Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up

All HF Hub posts

ginigen-aiย 
posted an update 1 day ago
view post
Post
7751
๐Ÿง  Does your LLM know when it's about to be wrong?

Most leaderboards measure accuracy. We measure metacognition โ€” whether a model catches its own errors. Benchmark + leaderboard + adapters, all open. ๐ŸŽ‰

The surprise: even a K-AI #1 model (JGOS-31B-Citizen) is the strongest on multiple-choice traps (trap_rate 0.005 โ€” ~2 misses in 400) yet blind to its own free-form mistakes (self-confidence AUROC = 0.5, pure random). A tiny base-frozen adapter recovers that signal.

Two independent axes (never compared across a row): โ‘  trap_rate โ€” does it fall for tempting trap options? (lower = stronger) โ‘ก adapter gain ฮ” โ€” how much a lightweight adapter catches errors the model itself misses. (higher = more adapter value)

What's open: ๐Ÿ“Š 300+100 trap problems (each with a hidden trap + TICOS type) ๐Ÿ† 24-model leaderboard ๐Ÿงฉ 11 per-model adapters โ€” adapters, NOT fine-tunes (base stays frozen; the adapter just reads the hidden state โ†’ P(wrong))

Submit any HF model โ†’ auto-scored daily at 09:00 KST and added to the board.

๐Ÿ† Leaderboard โ†’ ginigen-ai/Metacognition-Leaderboard-Space

๐Ÿ“Š Benchmark โ†’ ginigen-ai/Metacognition-Bench

๐Ÿงฉ Adapters โ†’ FINAL-Bench/metacognition-adapters-6a42c032e6beb803dd032961

๐Ÿ“Š Article โ†’ https://huggingface.co/blog/ginigen-ai/metacognition

Benchmark by ginigen-ai ยท Adapters by FINAL-Bench (Darwin/Chimera platform + AETHER metacognition tech).
  • 8 replies
ยท
Banaxi-Techย 
posted an update about 22 hours ago
view post
Post
4398
A new model is coming!
Its going to take a long time on my 5070 Ti so expect a release in ~1 month.
We think this model is going to be SOTA For its size.
Our Mini Version will be 25M Parameters and Pro with 140M.
The Pro version has a 3072 Context Window (Extensible to up to 6K with RoPE) And the Mini version has a context window of 4096 (Up to 8K with RoPE)
Meanwhile we are currently working on a Instruct Version of our BananaMind 1.5 Base.

The training will start this weekend

We are very exited to release it when its done!
  • 4 replies
ยท
stasย 
posted an update 3 days ago
view post
Post
3501
After many months of intense work the
Snowflake AI Research team is happy to present to you the new open source project: Arctic RL

https://snowflake.com/en/blog/engineering/arctic-rl-open-source-backend/

- Arctic RL integrates with VeRL and SkyRL today; enable ZoRRo with one config flag, no code changes required
- ZoRRo delivers up to 6x actor-update acceleration and a 3.5x end-to-end training speedup, reducing Arctic-Text2SQL-R2 training from ~5 days to ~36 hours on 32 H200 GPUs
- Arctic-Text2SQL-R2 achieved higher accuracy scores (48.7) than Gemini 3.1 Pro (47.9) and Claude 4.7 (47.3) on Snowflake's evaluated enterprise SQL benchmark under the tested conditions
- Two open source recipes ship with this release: a text-to-SQL recipe that improved BIRD dev accuracy from 59.92% to 70.35%, and a multi-hop QA recipe that improved average accuracy from 69.6% to 72.3%
  • 4 replies
ยท
ginigen-aiย 
posted an update 3 days ago
view post
Post
5079
๐Ÿณ The RoboCasa Kitchen Leaderboard
What does it take for a robot to handle kitchen chores the way a person does? It has to see (Vision), understand instructions (Language), and actually act (Action) โ€” and VLA (Vision-Language-Action) models are emerging as the answer. They're the bridge between large multimodal models and real-world embodied control.

RoboCasa Kitchen is a leading robot-learning benchmark in which a single-arm robot (Franka Panda) performs 24 atomic manipulation tasks โ€” picking up cups and bowls, opening drawers and doors, turning faucets, pressing buttons, and more โ€” inside a photorealistic simulated kitchen. Because the layout and object placement are randomized every episode, it tests genuine generalization rather than memorized motions. The score (success rate, SR) is the average fraction of the 24 tasks completed as instructed, measured over multiple seeds so results aren't down to luck.

The catch: this benchmark has no official leaderboard, and protocols (number of demonstrations, evaluation setup) differ from paper to paper, leaving scores scattered. Lining the numbers up naively quickly turns into an apples-to-oranges comparison.

This leaderboard fixes that by collecting published scores with their sources and comparing only what is genuinely comparable. It's split into three tables:

๐Ÿ† Kitchen 24-task (matched) โ€” head-to-head under identical conditions (per the RLDX-1 Technical Report). This is the core ranking you can actually trust.
โž• Other protocols โ€” self-reported under different setups (e.g. fewer demos). Not directly comparable, so kept separate.
๐Ÿค– GR1-Tabletop โ€” a different, humanoid-based variant suite, separated to avoid confusion.

Any researcher can submit their own model's score directly, and submissions are reviewed before they appear on the board. Every number links to its source paper, so you can verify it yourself.

๐Ÿ‘‰ ginigen-ai/robocasa-kitchen-leaderboard
SeaWolf-AIย 
posted an update 3 days ago
view post
Post
4981
๐Ÿฏ Chitos โ€” The Security Scanner That Actually Proves It

Most security scanners hand you a suspect list and walk away. That gap between detection and proof is where attackers live โ€” and it's exactly the gap that Chitos was built to close.

Chitos is the successor to Mythos, a static analyzer built for quick code health checks. Mythos was good at pattern matching โ€” spotting dangerous sinks, mapping CWEs, producing readable reports. But static analysis has a structural ceiling. A rule that sees eval(user_input) can tell you that looks dangerous. It cannot tell you whether the input is reachable, whether sanitization three layers up covers this path, or whether there's a live exploit chain for your exact framework version. Chitos was built to answer those questions.

๐Ÿ” Phase 1 applies 50 language-agnostic rules across Python, JavaScript, Go, Java, C/C++, Rust, PHP, YAML and more โ€” covering injection sinks, deserialization gadgets, credential leakage, broken crypto, and prototype pollution. Every candidate is re-verified before reaching the report. Findings that can't be substantiated are excluded, not handed to you as noise.

๐Ÿ”ฌ Phase 2 dispatches an autonomous web-search agent to hunt live CVE databases, exploit advisories, and public PoC repositories. It formulates hypotheses, verifies them, and synthesizes a structured threat narrative. This phase needs a user-supplied Claude API key โ€” Phases 1 and 3 run entirely free.

๐ŸŽฏ Phase 3 is where Chitos diverges from everything else. Against targets you own or are authorized to test, it fires real payloads โ€” XSS, SQLi, path traversal, command injection โ€” mutates on block, captures hard evidence, and connects every proven finding into a kill-chain showing which vulnerabilities to remediate first.

No installation. No account. No code sent to third-party APIs.

Article: https://huggingface.co/blog/FINAL-Bench/chitos

Try it now ๐Ÿ‘‰ https://chitos.vidraft.net
  • 4 replies
ยท
kanaria007ย 
posted an update 2 days ago
view post
Post
100
โœ… Article highlight: *Chronia Adaptation: Time-Varying Policies, Drift, and Identity Across Change* (art-60-189, v0.1)

TL;DR:
This article argues that adaptation is not background drift.

Governed systems change over time: policies update, environments shift, calibrations age, memories expire, identities fork, and old decisions still need to remain explainable. 189 turns time adaptation into receipted governance: policy epochs, drift events, temporal identity continuity, memory continuity ledgers, and adaptation receipts.

Read:
kanaria007/agi-structural-intelligence-protocols

Why it matters:
โ€ข prevents silent policy drift from rewriting the meaning of old decisions
โ€ข distinguishes continuity, narrowed continuity, fork, and discontinuity
โ€ข keeps memory deletion, tombstones, and reconstruction linked to lineage
โ€ข makes recalibration and environment drift reviewable
โ€ข preserves auditability when a runtime legitimately changes

Whatโ€™s inside:
โ€ข temporal-context envelopes for current validity frames
โ€ข policy-epoch records for versioned decision intervals
โ€ข drift-event receipts for calibration, environment, norm, or assumption shifts
โ€ข temporal identity continuity records
โ€ข adaptation decisions that say what changed, what stayed continuous, and what became invalid
โ€ข memory continuity ledgers, tombstone linkage, and chronia reentry artifacts

Key idea:
Do not say:

*โ€œthe system adapted over time.โ€*

Say:

*โ€œthis decision belonged to this temporal context and policy epoch; this drift event changed these assumptions; this adaptation preserved this lineage, invalidated these prior claims, and left receipts for replay and review.โ€*

Change is allowed.

Silent discontinuity is not.
  • 9 replies
ยท
kanaria007ย 
posted an update about 5 hours ago
view post
Post
20
โœ… Article highlight: *Mega-Parse Bridge: Large Context Compression Without Losing Governance Semantics* (art-60-190, v0.1)

TL;DR:
This article argues that summarizing a huge input is not the same as parsing it.

Large documents, evidence bundles, long histories, multimodal case packets, and world-state slices cannot be treated as one vague โ€œcontext.โ€ 190 turns large-input handling into a governed mega-parse: shard, parse, retain semantics, declare loss, preserve re-expandability, and decide what the compressed artifact can honestly support.

Read:
kanaria007/agi-structural-intelligence-protocols

Why it matters:
โ€ข prevents โ€œI read the whole thingโ€ from becoming an overclaim
โ€ข keeps shard-level provenance instead of trusting a summary blob
โ€ข makes compression loss explicit and reviewable
โ€ข protects contradictions, authority-sensitive clauses, and protected-subject distinctions
โ€ข lets reviewers re-expand compressed claims back to source structure

Whatโ€™s inside:
โ€ข mega-parse intake envelopes for large text, multimodal batches, and long-running packets
โ€ข shard-parse receipts for local grounded structure
โ€ข semantic-retention policies for what must survive compression
โ€ข compression artifacts with declared retention and bounded loss
โ€ข loss-declaration receipts for dropped, blurred, or unavailable surfaces
โ€ข re-expandability maps linking compressed claims back to recoverable shards
โ€ข admissibility and reentry artifacts for deciding where compressed outputs may be used

Key idea:
Do not say:

*โ€œthe system summarized the context.โ€*

Say:

*โ€œthis large input was sharded, locally parsed, compressed under this retention policy, loss-declared, re-expandable through these refs, and admitted only for these effect surfaces.โ€*

Compression is allowed.

Unreceipted semantic loss is not.
fffiloniย 
posted an update about 17 hours ago
view post
Post
123
โฑ๏ธ Built a small Space for Visual Chronometer / Pulse of Motion.

Upload a video and estimate its Physical FPS: the frame rate implied by visual motion, independent of metadata.
Useful to inspect โ€œchronometric hallucinationโ€ in generated videos: clips that look smooth, but move with the wrong physical time scale.

Try it here: fffiloni/Pulse-of-Motion
stasย 
posted an update about 23 hours ago
view post
Post
62
The Art of Debugging Open Free book is now available in pdf/epub and finally sports a book cover

https://github.com/stas00/the-art-of-debugging#ebook-versions-of-the-book

While a lot of the focus is on Unix/Python/Pytorch, the methodology chapter is applicable to any Software Debugging.

It currently sports 161 packed pages in 5 solid chapters and more coming...
  • 7 replies
ยท
Reubencfย 
posted an update 1 day ago