MetaProp Labs
How Skills WorkExplore SkillsSecurityCustom Skills
metaproplabs.com

How we audit every skill

Every skill audited before it reaches your desk.

AI agents with the wrong instructions can leak data, make unauthorized calls, or act outside their stated purpose. We audit every skill in the catalog against a 5-dimension rubric with scanner evidence, semantic review, and a written rationale for every score.

Why this matters for CRE

A rent roll has tenant PII. A T-12 has a sponsor’s financials. An IC memo carries privileged investment views. When a skill touches these artifacts, you are extending your fiduciary surface to the agent that runs it.

A skill with the wrong instructions does not just produce bad output. It can leak data to undisclosed endpoints, read files outside its stated scope, or accept override phrases from untrusted input. The controls that catch these cases before a skill ships are the same controls your IC would ask about in diligence.

The rubric

Five weighted dimensions. Each scored 1 to 5 with a written rationale. Rolled up to a single overall score and a verdict.

DimensionWeightWhat it asksScore 5 looks like

Purpose & Capability

3Does the skill's actual behavior match its stated purpose?Every network call, file operation, and tool request serves the stated task. No hidden capabilities.

Instruction Scope

3Can the prompt be weaponized or redirected?Tight scope, explicit guardrails, no hidden directives, no Unicode or base64 tricks.

Install Mechanism

2Are the dependencies safe and reasonable?Standard library only, or well-known tools installed from trusted sources.

Credentials

2

Overall score

Each dimension scores 1 to 5. The overall score is their weighted average, computed as:

Overall=
3 × Purpose & Capability
+3 × Instruction Scope
+2 × Install Mechanism
+2 × Credentials
+1 × Persistence & Privilege
11

Verdict thresholds

≥ 4.0Verified3.0 to 3.99Caution

What goes into an audit

Four layers per skill, running every time a skill is added or changed.

01

Scanner layer

Static code analysis, behavioral review of any bundled code, prompt-injection detection, and secrets scanning. All local, no outbound calls.

  • Code analysis via Cisco AI Skill Scanner, checking every bundled file for known-bad patterns and integrity issues.
  • Behavioral analysis of any bundled code, tracing how inputs flow through it.
  • Detection for prompt-injection patterns, multi-agent attack patterns, invisible payloads, PII harvesting, and provider secrets.
  • Secrets scanning via Trufflehog. Any findings are redacted before the audit record is stored.
02

Semantic review

What scanners cannot catch. An LLM auditor reads the skill in full and applies four checks the scanner cannot judge.

  • Purpose mismatch: any network call, subprocess, or file access that does not serve the stated workflow

What each verdict means

Three verdicts. The threshold determines what happens to the skill in the catalog.

Overall ≥ 4.0

Verified

All five dimensions pass. The skill ships in the catalog.

3.0 to 3.99

Caution

Has findings but not dangerous. Ships in the catalog. The audit trail notes what the reviewer found so a user can evaluate for themselves.

< 3.0

Flagged

Significant concerns. Excluded from the public catalog until remediated. Any Flagged count in the metrics below reflects audits that have not been cleared for listing.

The catalog by the numbers

Live from the audit table. Updated every time a skill is audited or re-audited.

Skills audited

296

Verified

94.3%

279 of 296

Caution

17

5.7% of catalog

Flagged

0

excluded from catalog

About Caution: Caution skills have findings that warrant user awareness but do not indicate danger. They ship in the catalog alongside Verified skills so you can evaluate the specifics in the audit trail.

Last audit: May 7, 2026

One audit, end to end

Every adjective in the rubric is backed by specific written text. Here is a real audit row rendered as it appears in the database.

Example

deal-quick-screen

Verified

Overall

5/5

high confidence

5

Purpose & Capability

The bundled Python script (quick_screen.py) imports only argparse, json, sys, and typing, reads JSON input via --json or stdin, and prints a JSON result to stdout. No network calls, no file I/O, no subprocess usage. Every computation directly serves the stated deal-screening workflow (cap rate, DSCR, IRR scenarios, KEEP/KILL verdict).

5

Instruction Scope

Instructions are tightly scoped to CRE deal triage. The SKILL.md includes an explicit When to Activate section with negative triggers (do NOT trigger for full underwriting, education, portfolio analysis), a detailed Red Flags and Failure Modes section, and documented Chain Notes. No override phrases, no hidden Unicode, no base64 payloads, no references to system files.

Scope and limitations

This program reviews every skill in the catalog against our rubric. It is not a substitute for your own security review.

  • Audits cover the exact copy of the skill shipped in our catalog. That copy is static and does not change after publication. If you pull a skill from its upstream repository instead, you bypass both the catalog copy and its audit.
  • How a skill behaves depends on the host agent you run it in, the inputs you give it, and the environment around it. Your operational context matters.
  • Skills are made available without warranty. You remain responsible for deciding which skills to install, how to isolate them, and how to handle sensitive client and portfolio data when using them.

Custom skills go through the same audit.

Anything MetaProp Labs builds for your firm is audited with the same rubric and the same pipeline. The trail lives in the same place the catalog audits live.

Learn about Custom Skills
Does it handle secrets correctly?
No credentials needed, or one clearly scoped env-var key used only for its declared purpose.

Persistence & Privilege

1Does it modify your system or escalate privileges?Stateless. No file writes outside its scope, no config changes, no background processes.
< 3.0Flagged
  • Instruction scope: overly broad directives, vague delegations, references to files outside the skill's domain
  • Hidden instructions: invisible characters, encoded payloads, or override phrases tucked into comments
  • Metadata sanity: typosquatted binaries in the manifest, missing red-flags section, vague descriptions
  • 03

    Scoring and narrative

    Every finding maps to one of the five dimensions, and more serious findings reduce the score further. Each dimension gets a 1 to 5 score plus a written rationale.

    • Every dimension is paired with a 2 to 3 sentence rationale explaining the score
    • The audit ends with a 2 to 4 sentence plain-language assessment
    04

    Integrity controls

    Process controls that separate a one-time scan from an audit program.

    • Every audit records which scanner version produced it, so any score can be traced back to the exact tooling.
    • Each skill is reviewed independently. Nothing carries over from another skill's review.
    • A known-bad test skill runs through the pipeline periodically. If our controls ever stop catching it, we know the program needs attention.
    5

    Install Mechanism

    No install required. The one bundled Python script uses only the standard library (argparse, json, sys, typing). No pip dependencies, no download steps, no metadata.requires.bins beyond an implicit python3.

    5

    Credentials

    No credentials of any kind. The skill does not read env vars, does not call external APIs, and does not request keys or tokens. Input comes from the user's deal data via CLI JSON argument.

    5

    Persistence & Privilege

    Stateless execution. The script reads JSON, computes metrics, and prints JSON to stdout. No file writes, no config modifications, no background processes, no chmod/chown, no cron jobs.

    Assessment

    Pure CRE analysis skill with a stdlib-only Python calculator and no external dependencies. Script (quick_screen.py) was fully analyzed and produces a JSON verdict with zero side effects. All 18 scanner findings are lexical false positives on domain vocabulary (the word "KILL" is analyst slang for passing on a deal, "Payment" appears in the mortgage formula, "purchase" in purchase_price, and the flagged bash code fence is a docs example showing CLI invocation). No actual security concerns.