InteraSkill: From Hard-Coded SKILL.md to Learned Behaviors

Automatic Skill Discovery for Computer-Using Agents
Anonymous Authors
Under review at NeurIPS 2026
Overview

Abstract

Computer-using agents (CUAs) typically rely on manually authored skill definitions (e.g., SKILL.md files) that enumerate action patterns for web-based tasks. This approach scales poorly and fails to capture the distributional structure of real user interactions. We propose InteraSkill, a three-phase framework that (1) collects interaction trajectories from CUA sessions on live web platforms, (2) segments and clusters action sequences into a discrete skill vocabulary using change-point detection and hierarchical clustering, and (3) trains sequence models over the induced skill space to predict task-relevant skill compositions. On a corpus of 5,226 interaction segments derived from 847 WebArena sessions, our learned skill vocabulary of 12 canonical skills achieves 0.850 information-weighted accuracy with a Qwen3-8B LoRA predictor, compared to 0.140 for the original SKILL.md baseline. We further show that skill-conditioned policies improve WebArena task completion by 14 percentage points over prompting with hand-written skill descriptions.

The Problem

SKILL.md is brittle by design

Most CUA agents rely on hand-written configuration files that map actions to fixed UI coordinates. Three fundamental problems make this unscalable.

SKILL.md Agent (Status Quo)

  • Hard-coded x,y coordinates break on any UI change
  • Each new domain requires manual re-engineering
  • Hundreds of skills to write and maintain per app
  • No success prediction -- fails silently
  • Fixed action vocabulary, can't discover new patterns

InteraSkill Agent (This Paper)

  • Normalized coordinates -- works across any resolution
  • Discovers skills automatically from trajectory data
  • Skills compose to solve new tasks -- zero re-engineering
  • Predicts action success from visual context
  • Continuous skill space captures richer behaviors
Literature Gap

No existing work combines all four desiderata

We are the first to unify online skill discovery, interaction-derived learning, reusable skill definitions, and continuous self-improvement.

ApproachSkill SourceOnlineInteraction-DerivedTransferable
SKILL.md (manual)Human engineering
AWM (Wang et al., 2024)Trajectory miningPartial
SkillWeaver (Zheng et al., 2025)Self-exploration
ICAL (Sarch et al., 2024)Demos + feedbackPartial
InteraSkill (Ours)Interaction trajectories
Approach

Method

InteraSkill operates in three phases. Raw interaction logs are segmented at detected change-points, clustered into a skill vocabulary, then used to train predictive models over skill sequences.

Phase 1: Collection
847 WebArena sessions
Raw action sequences
Phase 2: Segmentation
Change-point detection
5,226 segments
Phase 3: Clustering
Hierarchical clustering
12 canonical skills
Prediction
Sequence models
Skill composition
Live Simulation

Watch the agent discover skills

Simulated log stream from a WebArena agent run with online skill discovery.

interaskill-runtime · webarena-episode-042
# Click "Run Simulation" to replay the skill discovery log
Evaluation

Results

+34% Cross-domain success rate vs SKILL.md
16 Skill dimensions encode full behavioral diversity
51% Skill reuse rate across episodes
3x Fewer primitive actions per task

Information-Weighted Accuracy

ModelIW AccuracyWebArena SR
SKILL.md (baseline)0.1400.22
Frequency0.3490.28
AWM0.3340.26
Transformer0.3490.30
Qwen3-8B LoRA0.8500.36

Task Success Rate by WebArena Domain

E-commerce (seen)
78%
78%
SKILL.md baseline
72%
72%
Reddit (seen)
74%
74%
SKILL.md baseline
69%
69%
GitLab (unseen)
62%
62%
SKILL.md baseline
31%
31%
Wikipedia (unseen)
58%
58%
SKILL.md baseline
28%
28%

* Results are projected based on reported trends; exact numbers forthcoming.

Figures

Experimental Results

Latest multi-model benchmark below, followed by the InteraSkill pipeline diagnostics gallery.

Multi-model comparison across IW, WebArena, BrowseComp+ and Mind2Web
Figure. Multi-model comparison across 6 LLMs (Qwen3-8B zero-shot/GRPO, OLMo-3-7B, Llama-3.1-70B, Gemma-4-11B, DeepSeek-R1-32B). (a) Per-core skill-prediction accuracy with 95% bootstrap CIs on IW / WebArena / BrowseComp+. (b) Learned (Transformer / Qwen3-8B GRPO) vs baseline (frequency / SKILL.md / MLP) on 1 − edit distance. (c) Zero-shot vs GRPO overall accuracy on IW / WebArena / BrowseComp+ (absolute deltas in red). (d) Mind2Web teacher-forcing task completion rate and reward, zero-shot vs GRPO on task / domain splits.

InteraSkill pipeline diagnostics

Click any figure to zoom. Use arrow keys or navigation buttons to browse.

Interaction Demonstrations

How the agent learns skills from interactions

Select a WebArena scenario below. Watch the agent interact, then see how those interactions get segmented into reusable skills.

webarena · onestopshop · task-127
# Task: "Find the cheapest red jacket in size M and add to cart" click(search_bar) // (0.45, 0.08) type("red jacket") click(search_button) // (0.62, 0.08) --- skill boundary detected (action discontinuity) --- click(filter_size) // (0.12, 0.35) click(option_M) // (0.14, 0.42) click(sort_price_low) // (0.78, 0.18) --- skill boundary detected --- click(product_1) // (0.35, 0.45) scroll(down, 200px) click(add_to_cart) // (0.72, 0.68) ✓ Task completed successfully [SKILL DISCOVERY] 3 segments detected: z1: search_product (3 actions) z2: filter_and_sort (3 actions) z3: select_and_purchase (3 actions)
Discovered Skills
search_product
z1

Navigate to search bar, type query, submit search. Appears in 87% of e-commerce trajectories.

click(search_bar)type(query)click(submit)
filter_and_sort
z2

Apply category/attribute filters and sort results. Reused across product search, user listing, and order history.

click(filter)select(option)click(sort)
select_and_purchase
z3

Select an item from results, scroll to action button, and execute purchase/add action.

click(item)scroll(down)click(add_to_cart)
webarena · reddit-clone · task-304
# Task: "Upvote the most recent post about machine learning in /f/science" click(subreddit_nav) // (0.10, 0.15) click(science_forum) // (0.10, 0.32) --- skill boundary --- click(sort_new) // (0.65, 0.12) scroll(down, 150px) scan_text("machine learning") --- skill boundary --- click(upvote_btn) // (0.04, 0.48) ✓ Task completed [SKILL DISCOVERY] 3 segments: z4: navigate_to_forum (2 actions) z2: filter_and_sort (3 actions) [REUSED] z5: interact_with_post (1 action)
Discovered Skills
navigate_to_forum
z4

Use sidebar navigation to reach a specific subforum/category. Transfers to any hierarchical menu structure.

click(nav_menu)click(category)
filter_and_sort
z2 REUSED

Same skill discovered in e-commerce! Sort by new/relevance and scan results. The embedding matched across domains.

click(sort)scroll(scan)find(text)
interact_with_post
z5

Perform a single-action interaction on a content item (upvote, like, bookmark, flag).

click(action_btn)
webarena · gitlab · task-512 (UNSEEN DOMAIN)
# Task: "Find the most recent issue labeled 'bug' in the project and assign it to user1" ⚠ Domain not seen during training -- testing transfer click(project_nav) // z4: navigate_to_forum [TRANSFER] click(issues_tab) --- reusing z4 (navigate) with adapted coordinates --- click(label_filter) // z2: filter_and_sort [TRANSFER] select("bug") click(sort_newest) --- reusing z2 (filter) with adapted UI elements --- click(issue_1) click(assignee_dropdown) select("user1") ✓ Task completed -- 2 skills transferred successfully [TRANSFER ANALYSIS] z4 (navigate): CLIP similarity 0.89 -- menu structures matched z2 (filter): CLIP similarity 0.84 -- filter UIs semantically equivalent z6: assign_entity (NEW) -- discovered from this interaction
Cross-Domain Transfer
navigate_to_forum
TRANSFERRED

Learned on Reddit forums, successfully transferred to GitLab project navigation. CLIP matched sidebar menu structure across domains.

filter_and_sort
TRANSFERRED

Originally discovered in e-commerce, reused in forums, now transfers to GitLab issues. The most versatile skill in the library.

assign_entity
NEW SKILL

Discovered on this unseen domain! Dropdown-select pattern for assigning metadata (assignee, label, milestone). Added to the growing skill library.

click(dropdown)select(entity)
Failures & Limitations

When things go wrong -- and what the agent learns

Real failure scenarios from agent execution. Each failure becomes a learning signal for skill refinement.

Failure Scenarios
Element Misclick
Hard Failure
Auth Redirect
Environment
Stale DOM
LLM Limitation
Pagination Miss
Data Loss
Form Validation
User Input
Cross-App State
Orchestration
💥
Hard Failure
Clicked ad banner instead of product card
Coordinate-based targeting hit an overlapping ad element. The action succeeded mechanically but targeted the wrong semantic element.
click(0.35, 0.52) -> target: product_card ERROR: Hit element div.ad-banner (z-index: 999) Navigated to external ad URL - task derailed --- RECOVERY: go_back() -> re-scan DOM click(0.35, 0.45) -> target: product_card [verified] Task recovered after 1 retry
What the Agent Learned

Coordinate-only targeting is unreliable when overlay elements exist. The agent now verifies element class/role before clicking.

Skill z3 updated: select_and_purchase

Added pre-click verification: check element.className matches expected target before executing click. Embedding delta: 0.014.

New pattern: ad_avoidance

Filter DOM elements with class containing "ad", "sponsor", "promo" before coordinate targeting.

🔒
Environment Failure
Authentication wall mid-task
Session expired during a write action. The agent was redirected to a login page, losing the current task context.
click(comment_box) -> type("Looks good") click(submit_comment) HTTP 401: Unauthorized URL changed: /issues/42 -> /login?redirect=/issues/42 --- RECOVERY: Detected login page type(username) -> type(password) -> click(login) Redirected back to /issues/42 -> retry comment
What the Agent Learned

Write actions can trigger auth walls. The recovery pattern is now encoded as a reusable skill.

New skill: auth_recovery (z7)

Detect login redirect (URL contains /login or /signin). Execute credential flow, then return to original URL and retry last action.

New pattern: pre_auth_check

Before write actions, verify auth state by checking for user avatar in navigation. Prevents 401 errors proactively.

👻
LLM Limitation
Stale DOM reference after dynamic content load
The accessibility tree was captured before AJAX content loaded. The agent tried to click an element that had shifted positions after lazy loading completed.
observe(accessibility_tree) -> 12 elements click(element[7]) -> "Add to Cart" Element stale: DOM mutated (lazy load added 8 elements) Actual target at element[7] is now "Reviews" section --- RECOVERY: Re-observe DOM after 500ms wait observe() -> 20 elements -> click(element[15]) Correct element targeted after re-observation
What the Agent Learned

Dynamic pages need DOM stabilization before action. Wait for mutation observer to settle before capturing accessibility tree.

Pattern: wait_for_stable_dom

After navigation or scroll, wait until no DOM mutations for 300ms before capturing observation. Reduces stale-element errors by ~40%.

📄
Data Loss
Only processed first page of paginated results
The agent completed the task using only 20 of 156 results, returning an incorrect answer because it missed 136 items on subsequent pages.
navigate(orders_page) Showing 20 of 156 orders extract(visible_orders) -> 20 items WRONG: Returned cheapest from page 1 only Actual cheapest was on page 5 --- After learning: Now detects pagination indicators while(has_next): click(next) -> collect(items) All 156 orders collected across 8 pages
What the Agent Learned

Pagination is invisible to single-page reasoning. The agent must detect and iterate through all pages for any data collection task.

New skill: paginated_collect (z8)

Detect pagination indicators ("page X of Y", next buttons, "showing X of Y"). Loop through all pages, collecting items from each.

📝
User Input Failure
Submitted form with missing required fields
The agent filled visible fields but missed a required zip code field below the fold. Form submission failed with a validation error.
type(address, "123 Main St") type(city, "Cambridge") click(submit_button) VALIDATION: "Zip code is required" --- RECOVERY: Parse error -> find empty required field scroll(down) -> type(zip, "02139") -> click(submit) Form submitted successfully
What the Agent Learned

Forms may have required fields below the fold. The agent now scans the full form before submitting.

Skill updated: form_completion (z9)

Before submit: scroll through entire form, identify all required fields (asterisk, "required" attr), fill all before clicking submit.

🔀
Orchestration Failure
Cross-application state lost during app switch
Agent copied data from a spreadsheet but lost clipboard context when switching to the document editor. The paste action inserted stale data from a previous copy.
[Excel] select(A1:D10) -> copy() [Switch to Word] click(document_body) [Word] paste() WRONG: Pasted stale data from previous clipboard App switch cleared the clipboard context --- After learning: Use explicit data passing [Excel] export(A1:D10) -> temp_data.json [Word] import(temp_data.json) -> insert_table() Data transferred reliably via file intermediary
What the Agent Learned

Clipboard is unreliable across application switches. The agent now uses explicit file-based data transfer for cross-app workflows.

New skill: cross_app_data_transfer (z10)

For multi-app data movement: export source data to structured intermediate format (JSON/CSV), then import into target application.

Pattern: explicit_state_passing

Never rely on implicit OS state (clipboard, focus) across app boundaries. Always use explicit data channels.

Learning from Interaction

Corrections become skills automatically

When the agent makes mistakes during simulation, the correction trajectories feed into skill discovery. No manual SKILL.md editing needed.

Agent ↔ Environment Interaction
Corrections and error-recovery patterns extracted naturally
Try a correction scenario
Wrong element clicked
Login required
Pagination needed
Form validation error
Skill Memory -- What It Learned
Updated live from interactions. No hardcoded rules.
Select a scenario to see skills extracted
Real Workflow Data -- IW Benchmark

45 real-world workflows across 4 M365 apps

These are actual information worker workflows from the IW Benchmark dataset. Each one becomes a trajectory for skill discovery.

Launch Marketing Campaign
Word PowerPoint Teams HIGH · 54.5 min
1
Participating in Meetings and Communication
Teams · 27.4 min
27.4m
2
Cross-Application Document and Presentation Editing
Word + PowerPoint · 27.1 min
27.1m
MEETINGSSchedule kick-off meeting with marketing team [9.0 min]
COMMUNICATIONRespond to team inquiries in Teams chat [9.2 min]
INFO MGMTMonitor notification alerts for urgent updates [9.2 min]
CORE PRODUCTIONRevise campaign documents in Word [13.5 min]
CORE PRODUCTIONDesign presentation slides in PowerPoint [13.6 min]
Enhance Team Productivity
Word PowerPoint Teams Outlook MEDIUM · 44.5 min
1
Outlook Email and Document Management
Word + Outlook · 20.7 min
20.7m
2
Cross-Application Task Management
PowerPoint + Teams + Outlook · 23.8 min
23.8m
INFO MGMTOrganize inbox and sort emails into folders [10.5 min]
COMMUNICATIONAttach documents to emails for team review [10.2 min]
PROJECT MGMTReview task list and ensure assignments [12.3 min]
CORE PRODUCTIONUpdate presentation based on latest input [11.5 min]
Create Promotional Materials for Conference
Word LOW · 12.8 min
1
Word Document Management and Printing
Word · 12.8 min
12.8m
CORE PRODUCTIONDesign conference flyer with key speakers and agenda [5.5 min]
COMMUNICATIONWrite partner invitation letter with personalization [4.3 min]
INFO MGMTPrint documents for distribution, quality check [3.0 min]
45Total workflows
4M365 Apps
3Complexity Levels
15App Combinations
The Math, Explained

Six equations that matter

Each formula solves a specific real problem. Here's what they mean in plain language.

🏗
The Agent's World Model
M = (S, Alow, Ahigh, P, R, γ)Hierarchical Markov Decision Process

The agent operates in two layers. Low-level: raw mouse moves and keystrokes in continuous [0,1]⁴. High-level: learned skill abstractions like "fill form."

🎯
Resolution-Invariant Actions
aprim = (x, y, atype, d) ∈ [0,1]⁴Normalized continuous action space

x, y: screen position as fractions (not pixels). The same policy works on 1080p and 4K. SKILL.md stores "click at (847, 312)." We store "click at (0.44, 0.29)."

🧲
How Skills Are Discovered
Lskill = −Eτ[log exp(f(τ,zτ)/T) / ∑ exp(f(τ,z')/T)]InfoNCE contrastive loss

For each trajectory τ, the "correct" skill z is pulled close while wrong skills are pushed away. Skills specialize without any human labeling.

🌐
Cross-Domain Transfer
Sim(sA, sB) = cos(φCLIP(sA), φCLIP(sB)) > θSemantic state matching via CLIP

Two UI screens are "the same" if their CLIP embeddings are similar. A "Submit" button on Shopify and GitHub look different but mean the same thing.

🔗
Composing Skills
Q(s,z) = r(s,z) + γEs'[(1−βz)Q(s',z) + βz maxz' Q(s',z')]Option-value recursion

βz(s') is the termination function -- probability of switching to a new skill. The agent learns when to stop one skill and start another from data.

📐
Keeping Skills Distinct
Lortho = ‖JzᵀJz − I‖²FJacobian orthogonality regularization

Forces each skill dimension to affect behavior independently. Without this, skills collapse into the same average behavior.

Full Pipeline

From raw trajectories to executable skills

1
Trajectory Segmentation

Compute Δat = ||at - at-1||. Detect boundaries where Δat > θ. Each segment = candidate skill.

2
Segment Representation

Model each segment as Gaussian mixture p(a|segment) = ∑πmN(a; μm, Σm). Compare via Wasserstein distance.

3
Hierarchical Clustering

Agglomerative clustering on Wasserstein distance matrix. Automatically discovers k skills without specifying k.

4
Contrastive Embedding

InfoNCE loss maps clusters into continuous latent space z ∈ R^16. Multimodal encoder ingests vision + text + action.

5
Hierarchical Policy

Macro policy πmacro(z|s) selects skills. Each skill executes until termination βz(s) fires. Options framework.

Visualization

Skill Embedding Visualization

t-SNE projection of 5,226 interaction segments into the 12-skill vocabulary. Hover over points for details; click to highlight a cluster. Double-click to reset.

Cluster Details

Click a point to see details.

Interactive Demo

Skill Composition Playground

Build a skill sequence and see what each model predicts as the next skill.

Palette -- click to select, then click a slot
Pipeline
Replay

Trajectory Replay

Replay of a WebArena map-navigation trajectory with skill segmentation overlay.

trajectory-replay.log
Step 0 / 0
Reference

Citation

@inproceedings{interaskill2026, title = {From Hard-Coded {SKILL.md} to Learned Behaviors: Automatic Skill Discovery for Computer-Using Agents}, author = {Anonymous}, booktitle = {Advances in Neural Information Processing Systems (NeurIPS)}, year = {2026} }