How We Score Security
Every entry on ClawGrid goes through a multi-layer evaluation pipeline before you see it. When you deploy via ClawGrid Cloud, these scores are surfaced during skill installation so you always know what you're running.
The Evaluation Pipeline
Registry Scraping
We pull every skill, agent, and extension from the OpenClaw registry and enrich it with GitHub metadata: stars, forks, contributors, commit activity, language, and topics. This gives us the raw signal about community adoption and maintenance.
Automated Security Scan
A pattern-matching scanner analyzes descriptions and metadata for known threat indicators: command injection, credential harvesting, data exfiltration, typosquatting, and coordinated malicious campaigns. Results appear as the Scan Status field.
LLM Security Scoring
Each entry is evaluated by a local LLM across five weighted criteria (see below). The scores are combined into a composite Security Score (1–10), from which the Verdict and Risk Level are deterministically derived. A written rationale explains the reasoning.
Community Voting
Authenticated users can upvote or downvote entries, adding a human signal layer on top of automated analysis. Vote scores influence ranking but not security verdicts.
The Five Scoring Criteria
Each criterion is scored 1–10 by the LLM, then combined using these weights:
Code Safety
30%Does the skill request dangerous capabilities? File system access, network calls, shell execution, credential handling. Higher weight because this directly affects user safety.
Publisher Trust
20%Is the publisher official (core team), verified (known community member), or anonymous? Official and verified publishers have established track records.
Scope Clarity
20%Is the skill's purpose well-defined and focused? Vague or overly broad descriptions are a warning sign — legitimate tools have clear, specific purposes.
Permission Surface
20%How many system resources does it need? A weather skill shouldn't need file system access. Lower permission surface = lower risk.
Community Signal
10%Is there evidence of real usage? GitHub stars, forks, recent commits, and contributor count signal that real developers have vetted and used the code.
Verdict Thresholds
The verdict is deterministically derived from the composite score — no subjective judgment:
Safe
7–10
Low risk
Review
5–6
Medium risk
Suspicious
3–4
High risk
Malicious
1–2
Critical risk
Entries flagged by the automated security scanner as threats are automatically classified as malicious regardless of their LLM score.
Transparency & Limitations
Scoring is metadata-based, not code-audited. We analyze descriptions, publisher info, permission declarations, and community signals — not the actual source code. A high score means the metadata looks trustworthy, not that the code has been audited line by line.
LLM analysis has inherent uncertainty. We use a local LLM (gemma3:4b) for scoring. While calibrated against ground truth data, LLM judgments can be wrong. The security scanner provides a deterministic second opinion.
Some entries have preliminary scores. Entries showing "Preliminary score" have been scored with an older model and are queued for detailed multi-criteria analysis. Their scores are normalized but less precise.
ClawGrid is a trust signal, not a guarantee. Always review the source code of any skill before granting it access to sensitive data or systems. Our scoring helps you prioritize what to review.