How We Score Security

Every entry on ClawGrid goes through a multi-layer evaluation pipeline before you see it. When you deploy via ClawGrid Cloud, these scores are surfaced during skill installation so you always know what you're running.

The Evaluation Pipeline

1️⃣

Registry Scraping

We pull every skill, agent, and extension from the OpenClaw registry and enrich it with GitHub metadata: stars, forks, contributors, commit activity, language, and topics. This gives us the raw signal about community adoption and maintenance.

2️⃣

Automated Security Scan

A pattern-matching scanner analyzes descriptions and metadata for known threat indicators: command injection, credential harvesting, data exfiltration, typosquatting, and coordinated malicious campaigns. Results appear as the Scan Status field.

3️⃣

LLM Security Scoring

Each entry is evaluated by a local LLM across five weighted criteria (see below). The scores are combined into a composite Security Score (1–10), from which the Verdict and Risk Level are deterministically derived. A written rationale explains the reasoning.

4️⃣

Community Voting

Authenticated users can upvote or downvote entries, adding a human signal layer on top of automated analysis. Vote scores influence ranking but not security verdicts.

The Five Scoring Criteria

Each criterion is scored 1–10 by the LLM, then combined using these weights:

Code Safety

30%

Does the skill request dangerous capabilities? File system access, network calls, shell execution, credential handling. Higher weight because this directly affects user safety.

Publisher Trust

20%

Is the publisher official (core team), verified (known community member), or anonymous? Official and verified publishers have established track records.

Scope Clarity

20%

Is the skill's purpose well-defined and focused? Vague or overly broad descriptions are a warning sign — legitimate tools have clear, specific purposes.

Permission Surface

20%

How many system resources does it need? A weather skill shouldn't need file system access. Lower permission surface = lower risk.

Community Signal

10%

Is there evidence of real usage? GitHub stars, forks, recent commits, and contributor count signal that real developers have vetted and used the code.

Verdict Thresholds

The verdict is deterministically derived from the composite score — no subjective judgment:

Safe

7–10

Low risk

Review

5–6

Medium risk

Suspicious

3–4

High risk

Malicious

1–2

Critical risk

Entries flagged by the automated security scanner as threats are automatically classified as malicious regardless of their LLM score.

Transparency & Limitations

Scoring is metadata-based, not code-audited. We analyze descriptions, publisher info, permission declarations, and community signals — not the actual source code. A high score means the metadata looks trustworthy, not that the code has been audited line by line.

LLM analysis has inherent uncertainty. We use a local LLM (gemma3:4b) for scoring. While calibrated against ground truth data, LLM judgments can be wrong. The security scanner provides a deterministic second opinion.

Some entries have preliminary scores. Entries showing "Preliminary score" have been scored with an older model and are queued for detailed multi-criteria analysis. Their scores are normalized but less precise.

ClawGrid is a trust signal, not a guarantee. Always review the source code of any skill before granting it access to sensitive data or systems. Our scoring helps you prioritize what to review.