InteraSkill: From Hard-Coded SKILL.md to Learned Behaviors

Overview

Abstract

Computer-using agents (CUAs) typically rely on manually authored skill definitions (e.g., SKILL.md files) that enumerate action patterns for web-based tasks. This approach scales poorly and fails to capture the distributional structure of real user interactions. We propose InteraSkill, a three-phase framework that (1) collects interaction trajectories from CUA sessions on live web platforms, (2) segments and clusters action sequences into a discrete skill vocabulary using change-point detection and hierarchical clustering, and (3) trains sequence models over the induced skill space to predict task-relevant skill compositions. On a corpus of 5,226 interaction segments derived from 847 WebArena sessions, our learned skill vocabulary of 12 canonical skills achieves 0.850 information-weighted accuracy with a Qwen3-8B LoRA predictor, compared to 0.140 for the original SKILL.md baseline. We further show that skill-conditioned policies improve WebArena task completion by 14 percentage points over prompting with hand-written skill descriptions.

The Problem

SKILL.md is brittle by design

Most CUA agents rely on hand-written configuration files that map actions to fixed UI coordinates. Three fundamental problems make this unscalable.

SKILL.md Agent (Status Quo)

✗ Hard-coded x,y coordinates break on any UI change
✗ Each new domain requires manual re-engineering
✗ Hundreds of skills to write and maintain per app
✗ No success prediction -- fails silently
✗ Fixed action vocabulary, can't discover new patterns

InteraSkill Agent (This Paper)

✓ Normalized coordinates -- works across any resolution
✓ Discovers skills automatically from trajectory data
✓ Skills compose to solve new tasks -- zero re-engineering
✓ Predicts action success from visual context
✓ Continuous skill space captures richer behaviors

Literature Gap

No existing work combines all four desiderata

We are the first to unify online skill discovery, interaction-derived learning, reusable skill definitions, and continuous self-improvement.

Approach	Skill Source	Online	Interaction-Derived	Transferable
SKILL.md (manual)	Human engineering	✗	✗	✗
AWM (Wang et al., 2024)	Trajectory mining	Partial	✗	✓
SkillWeaver (Zheng et al., 2025)	Self-exploration	✗	✗	✓
ICAL (Sarch et al., 2024)	Demos + feedback	✓	Partial	✗
InteraSkill (Ours)	Interaction trajectories	✓	✓	✓

Approach

Method

InteraSkill operates in three phases. Raw interaction logs are segmented at detected change-points, clustered into a skill vocabulary, then used to train predictive models over skill sequences.

Phase 1: Collection

847 WebArena sessions
Raw action sequences

→

Phase 2: Segmentation

Change-point detection
5,226 segments

→

Phase 3: Clustering

Hierarchical clustering
12 canonical skills

→

Prediction

Sequence models
Skill composition

Live Simulation

Watch the agent discover skills

Simulated log stream from a WebArena agent run with online skill discovery.

# Click "Run Simulation" to replay the skill discovery log

Evaluation

Results

+34% Cross-domain success rate vs SKILL.md

16 Skill dimensions encode full behavioral diversity

51% Skill reuse rate across episodes

3x Fewer primitive actions per task

Information-Weighted Accuracy

Model	IW Accuracy	WebArena SR
SKILL.md (baseline)	0.140	0.22
Frequency	0.349	0.28
AWM	0.334	0.26
Transformer	0.349	0.30
Qwen3-8B LoRA	0.850	0.36

Task Success Rate by WebArena Domain

E-commerce (seen)

78%

SKILL.md baseline

72%

Reddit (seen)

74%

SKILL.md baseline

69%

GitLab (unseen)

62%

SKILL.md baseline

31%

Wikipedia (unseen)

58%

SKILL.md baseline

28%

* Results are projected based on reported trends; exact numbers forthcoming.

Figures

Experimental Results

Latest multi-model benchmark below, followed by the InteraSkill pipeline diagnostics gallery.

Multi-model comparison across IW, WebArena, BrowseComp+ and Mind2Web — **Figure.** Multi-model comparison across 6 LLMs (Qwen3-8B zero-shot/GRPO, OLMo-3-7B, Llama-3.1-70B, Gemma-4-11B, DeepSeek-R1-32B). (a) Per-core skill-prediction accuracy with 95% bootstrap CIs on IW / WebArena / BrowseComp+. (b) Learned (Transformer / Qwen3-8B GRPO) vs baseline (frequency / SKILL.md / MLP) on 1 − edit distance. (c) Zero-shot vs GRPO overall accuracy on IW / WebArena / BrowseComp+ (absolute deltas in red). (d) Mind2Web teacher-forcing task completion rate and reward, zero-shot vs GRPO on task / domain splits.

InteraSkill pipeline diagnostics

Click any figure to zoom. Use arrow keys or navigation buttons to browse.

t-SNE of learned skill embeddings (IW, 16-dim to 2-dim). 12 distinct clusters.

Clustering confusion matrix (IW). document_edit perfectly recovered (1.00).

Segmentation discontinuity histogram with threshold. F1 = 0.534.

Per-position skill prediction accuracy. MLP vs Transformer comparison.

Phase 3 composition: MLP vs Transformer per-position accuracy.

t-SNE of skill embeddings on WebArena trajectories. More overlap due to map-domain homogeneity.

Clustering confusion matrix (WebArena). Less separation vs IW due to limited domain diversity.

InfoNCE encoder training/validation loss over 200 epochs.

Transformer policy training curves (500 epochs). Skill sequence prediction loss.

Interaction Demonstrations

How the agent learns skills from interactions

Select a WebArena scenario below. Watch the agent interact, then see how those interactions get segmented into reusable skills.

webarena · onestopshop · task-127

# Task: "Find the cheapest red jacket in size M and add to cart" ❯ click(search_bar) // (0.45, 0.08) ❯ type("red jacket") ❯ click(search_button) // (0.62, 0.08) --- skill boundary detected (action discontinuity) --- ❯ click(filter_size) // (0.12, 0.35) ❯ click(option_M) // (0.14, 0.42) ❯ click(sort_price_low) // (0.78, 0.18) --- skill boundary detected --- ❯ click(product_1) // (0.35, 0.45) ❯ scroll(down, 200px) ❯ click(add_to_cart) // (0.72, 0.68) ✓ Task completed successfully [SKILL DISCOVERY] 3 segments detected: z1: search_product (3 actions) z2: filter_and_sort (3 actions) z3: select_and_purchase (3 actions)

Discovered Skills

search_product

Navigate to search bar, type query, submit search. Appears in 87% of e-commerce trajectories.

click(search_bar)type(query)click(submit)

filter_and_sort

Apply category/attribute filters and sort results. Reused across product search, user listing, and order history.

click(filter)select(option)click(sort)

select_and_purchase

Select an item from results, scroll to action button, and execute purchase/add action.

click(item)scroll(down)click(add_to_cart)

webarena · reddit-clone · task-304

# Task: "Upvote the most recent post about machine learning in /f/science" ❯ click(subreddit_nav) // (0.10, 0.15) ❯ click(science_forum) // (0.10, 0.32) --- skill boundary --- ❯ click(sort_new) // (0.65, 0.12) ❯ scroll(down, 150px) ❯ scan_text("machine learning") --- skill boundary --- ❯ click(upvote_btn) // (0.04, 0.48) ✓ Task completed [SKILL DISCOVERY] 3 segments: z4: navigate_to_forum (2 actions) z2: filter_and_sort (3 actions) [REUSED] z5: interact_with_post (1 action)

Discovered Skills

navigate_to_forum

Use sidebar navigation to reach a specific subforum/category. Transfers to any hierarchical menu structure.

click(nav_menu)click(category)

filter_and_sort

z2 REUSED

Same skill discovered in e-commerce! Sort by new/relevance and scan results. The embedding matched across domains.

click(sort)scroll(scan)find(text)

interact_with_post

Perform a single-action interaction on a content item (upvote, like, bookmark, flag).

click(action_btn)

webarena · gitlab · task-512 (UNSEEN DOMAIN)

# Task: "Find the most recent issue labeled 'bug' in the project and assign it to user1" ⚠ Domain not seen during training -- testing transfer ❯ click(project_nav) // z4: navigate_to_forum [TRANSFER] ❯ click(issues_tab) --- reusing z4 (navigate) with adapted coordinates --- ❯ click(label_filter) // z2: filter_and_sort [TRANSFER] ❯ select("bug") ❯ click(sort_newest) --- reusing z2 (filter) with adapted UI elements --- ❯ click(issue_1) ❯ click(assignee_dropdown) ❯ select("user1") ✓ Task completed -- 2 skills transferred successfully [TRANSFER ANALYSIS] z4 (navigate): CLIP similarity 0.89 -- menu structures matched z2 (filter): CLIP similarity 0.84 -- filter UIs semantically equivalent z6: assign_entity (NEW) -- discovered from this interaction

Cross-Domain Transfer

navigate_to_forum

TRANSFERRED

Learned on Reddit forums, successfully transferred to GitLab project navigation. CLIP matched sidebar menu structure across domains.

filter_and_sort

TRANSFERRED

Originally discovered in e-commerce, reused in forums, now transfers to GitLab issues. The most versatile skill in the library.

assign_entity

NEW SKILL

Discovered on this unseen domain! Dropdown-select pattern for assigning metadata (assignee, label, milestone). Added to the growing skill library.

click(dropdown)select(entity)

Failures & Limitations

When things go wrong -- and what the agent learns

Real failure scenarios from agent execution. Each failure becomes a learning signal for skill refinement.

Failure Scenarios

Element Misclick

Hard Failure

Auth Redirect

Environment

Stale DOM

LLM Limitation

Pagination Miss

Data Loss

Form Validation

User Input

Cross-App State

Orchestration

💥

Hard Failure

Clicked ad banner instead of product card

Coordinate-based targeting hit an overlapping ad element. The action succeeded mechanically but targeted the wrong semantic element.

click(0.35, 0.52) -> target: product_card ERROR: Hit element div.ad-banner (z-index: 999) Navigated to external ad URL - task derailed --- RECOVERY: go_back() -> re-scan DOM click(0.35, 0.45) -> target: product_card [verified] Task recovered after 1 retry

What the Agent Learned

Coordinate-only targeting is unreliable when overlay elements exist. The agent now verifies element class/role before clicking.

Skill z3 updated: select_and_purchase

Added pre-click verification: check element.className matches expected target before executing click. Embedding delta: 0.014.

New pattern: ad_avoidance

Filter DOM elements with class containing "ad", "sponsor", "promo" before coordinate targeting.

🔒

Environment Failure

Authentication wall mid-task

Session expired during a write action. The agent was redirected to a login page, losing the current task context.

click(comment_box) -> type("Looks good") click(submit_comment) HTTP 401: Unauthorized URL changed: /issues/42 -> /login?redirect=/issues/42 --- RECOVERY: Detected login page type(username) -> type(password) -> click(login) Redirected back to /issues/42 -> retry comment

What the Agent Learned

Write actions can trigger auth walls. The recovery pattern is now encoded as a reusable skill.

New skill: auth_recovery (z7)

Detect login redirect (URL contains /login or /signin). Execute credential flow, then return to original URL and retry last action.

New pattern: pre_auth_check

Before write actions, verify auth state by checking for user avatar in navigation. Prevents 401 errors proactively.

👻

LLM Limitation

Stale DOM reference after dynamic content load

The accessibility tree was captured before AJAX content loaded. The agent tried to click an element that had shifted positions after lazy loading completed.

observe(accessibility_tree) -> 12 elements click(element[7]) -> "Add to Cart" Element stale: DOM mutated (lazy load added 8 elements) Actual target at element[7] is now "Reviews" section --- RECOVERY: Re-observe DOM after 500ms wait observe() -> 20 elements -> click(element[15]) Correct element targeted after re-observation

What the Agent Learned

Dynamic pages need DOM stabilization before action. Wait for mutation observer to settle before capturing accessibility tree.

Pattern: wait_for_stable_dom

After navigation or scroll, wait until no DOM mutations for 300ms before capturing observation. Reduces stale-element errors by ~40%.

📄

Data Loss

Only processed first page of paginated results

The agent completed the task using only 20 of 156 results, returning an incorrect answer because it missed 136 items on subsequent pages.

navigate(orders_page) Showing 20 of 156 orders extract(visible_orders) -> 20 items WRONG: Returned cheapest from page 1 only Actual cheapest was on page 5 --- After learning: Now detects pagination indicators while(has_next): click(next) -> collect(items) All 156 orders collected across 8 pages

What the Agent Learned

Pagination is invisible to single-page reasoning. The agent must detect and iterate through all pages for any data collection task.

New skill: paginated_collect (z8)

Detect pagination indicators ("page X of Y", next buttons, "showing X of Y"). Loop through all pages, collecting items from each.

📝

User Input Failure

Submitted form with missing required fields

The agent filled visible fields but missed a required zip code field below the fold. Form submission failed with a validation error.

type(address, "123 Main St") type(city, "Cambridge") click(submit_button) VALIDATION: "Zip code is required" --- RECOVERY: Parse error -> find empty required field scroll(down) -> type(zip, "02139") -> click(submit) Form submitted successfully

What the Agent Learned

Forms may have required fields below the fold. The agent now scans the full form before submitting.

Skill updated: form_completion (z9)

Before submit: scroll through entire form, identify all required fields (asterisk, "required" attr), fill all before clicking submit.

🔀

Orchestration Failure

Cross-application state lost during app switch

Agent copied data from a spreadsheet but lost clipboard context when switching to the document editor. The paste action inserted stale data from a previous copy.

[Excel] select(A1:D10) -> copy() [Switch to Word] click(document_body) [Word] paste() WRONG: Pasted stale data from previous clipboard App switch cleared the clipboard context --- After learning: Use explicit data passing [Excel] export(A1:D10) -> temp_data.json [Word] import(temp_data.json) -> insert_table() Data transferred reliably via file intermediary

What the Agent Learned

Clipboard is unreliable across application switches. The agent now uses explicit file-based data transfer for cross-app workflows.

New skill: cross_app_data_transfer (z10)

For multi-app data movement: export source data to structured intermediate format (JSON/CSV), then import into target application.

Pattern: explicit_state_passing

Never rely on implicit OS state (clipboard, focus) across app boundaries. Always use explicit data channels.

Learning from Interaction

Corrections become skills automatically

When the agent makes mistakes during simulation, the correction trajectories feed into skill discovery. No manual SKILL.md editing needed.

Agent ↔ Environment Interaction

Corrections and error-recovery patterns extracted naturally

Try a correction scenario

Wrong element clicked

Pagination needed

Form validation error

Skill Memory -- What It Learned

Updated live from interactions. No hardcoded rules.

Select a scenario to see skills extracted

Real Workflow Data -- IW Benchmark

45 real-world workflows across 4 M365 apps

These are actual information worker workflows from the IW Benchmark dataset. Each one becomes a trajectory for skill discovery.

Launch Marketing Campaign

Word PowerPoint Teams HIGH · 54.5 min

Participating in Meetings and Communication

Teams · 27.4 min

27.4m

Cross-Application Document and Presentation Editing

Word + PowerPoint · 27.1 min

27.1m

MEETINGSSchedule kick-off meeting with marketing team [9.0 min]

COMMUNICATIONRespond to team inquiries in Teams chat [9.2 min]

INFO MGMTMonitor notification alerts for urgent updates [9.2 min]

CORE PRODUCTIONRevise campaign documents in Word [13.5 min]

CORE PRODUCTIONDesign presentation slides in PowerPoint [13.6 min]

Enhance Team Productivity

Word PowerPoint Teams Outlook MEDIUM · 44.5 min

Outlook Email and Document Management

Word + Outlook · 20.7 min

20.7m

Cross-Application Task Management

PowerPoint + Teams + Outlook · 23.8 min

23.8m

INFO MGMTOrganize inbox and sort emails into folders [10.5 min]

COMMUNICATIONAttach documents to emails for team review [10.2 min]

PROJECT MGMTReview task list and ensure assignments [12.3 min]

CORE PRODUCTIONUpdate presentation based on latest input [11.5 min]

Create Promotional Materials for Conference

Word LOW · 12.8 min

Word Document Management and Printing

Word · 12.8 min

12.8m

CORE PRODUCTIONDesign conference flyer with key speakers and agenda [5.5 min]

COMMUNICATIONWrite partner invitation letter with personalization [4.3 min]

INFO MGMTPrint documents for distribution, quality check [3.0 min]

45Total workflows

4M365 Apps

3Complexity Levels

15App Combinations

The Math, Explained

Six equations that matter

Each formula solves a specific real problem. Here's what they mean in plain language.

🏗

The Agent's World Model

M = (S, A_low, A_high, P, R, γ)Hierarchical Markov Decision Process

The agent operates in two layers. Low-level: raw mouse moves and keystrokes in continuous [0,1]⁴. High-level: learned skill abstractions like "fill form."

🎯

Resolution-Invariant Actions

a_prim = (x, y, a_type, d) ∈ [0,1]⁴Normalized continuous action space

x, y: screen position as fractions (not pixels). The same policy works on 1080p and 4K. SKILL.md stores "click at (847, 312)." We store "click at (0.44, 0.29)."

🧲

How Skills Are Discovered

L_skill = −E_τ[log exp(f(τ,z_τ)/T) / ∑ exp(f(τ,z')/T)]InfoNCE contrastive loss

For each trajectory τ, the "correct" skill z is pulled close while wrong skills are pushed away. Skills specialize without any human labeling.

🌐

Cross-Domain Transfer

Sim(s_A, s_B) = cos(φ_CLIP(s_A), φ_CLIP(s_B)) > θSemantic state matching via CLIP

Two UI screens are "the same" if their CLIP embeddings are similar. A "Submit" button on Shopify and GitHub look different but mean the same thing.

🔗

Composing Skills

Q(s,z) = r(s,z) + γE_s'[(1−β_z)Q(s',z) + β_z max_z' Q(s',z')]Option-value recursion

β_z(s') is the termination function -- probability of switching to a new skill. The agent learns when to stop one skill and start another from data.

📐

Keeping Skills Distinct

L_ortho = ‖J_zᵀJ_z − I‖²_FJacobian orthogonality regularization

Forces each skill dimension to affect behavior independently. Without this, skills collapse into the same average behavior.

Full Pipeline

From raw trajectories to executable skills

Trajectory Segmentation

Compute Δa_t = ||a_t - a_t-1||. Detect boundaries where Δa_t > θ. Each segment = candidate skill.

Segment Representation

Model each segment as Gaussian mixture p(a|segment) = ∑π_mN(a; μ_m, Σ_m). Compare via Wasserstein distance.

Hierarchical Clustering

Agglomerative clustering on Wasserstein distance matrix. Automatically discovers k skills without specifying k.

Contrastive Embedding

InfoNCE loss maps clusters into continuous latent space z ∈ R^16. Multimodal encoder ingests vision + text + action.

Hierarchical Policy

Macro policy π_macro(z|s) selects skills. Each skill executes until termination β_z(s) fires. Options framework.

InteraSkill: From Hard-Coded SKILL.md to Learned Behaviors

Abstract

SKILL.md is brittle by design

SKILL.md Agent (Status Quo)

InteraSkill Agent (This Paper)

No existing work combines all four desiderata

Method

Watch the agent discover skills

Results

Information-Weighted Accuracy

Task Success Rate by WebArena Domain

Experimental Results

InteraSkill pipeline diagnostics

How the agent learns skills from interactions

When things go wrong -- and what the agent learns

What the Agent Learned

What the Agent Learned

What the Agent Learned

What the Agent Learned

What the Agent Learned

What the Agent Learned

Corrections become skills automatically

45 real-world workflows across 4 M365 apps

Six equations that matter

From raw trajectories to executable skills

Trajectory Segmentation

Segment Representation

Hierarchical Clustering

Contrastive Embedding

Hierarchical Policy

Skill Embedding Visualization

Cluster Details

Skill Composition Playground

Trajectory Replay

Citation