Mark Erdmann's Highlights on 'Rohan Paul on X: "Transformer models can learn robust reasoning skills (beyond those of GPT-4-Turbo and Gemini-1.5-Pro) through a stage of training dynamics that continues far beyond the point of overfitting (i.e. with 'Grokking') 🤯 For a challenging reasoning task with a large search space,… https://t.co/Tl9bND5PHq" / X'

clem 🤗 on X: "Pumped to announce the brand new open LLM leaderboard. We burned 300 H100 to re-run new evaluations like MMLU-pro for all major open LLMs! Some learning: - Qwen 72B is the king and Chinese open models are dominating overall - Previous evaluations have become too easy for recent" / X

x.com/ClementDelangue/status/1805989925080219927

Jun 26, 2024

Anna Mills, annamillsoer.bsky.social, she/her on X: "Can we tell when a student submission is AI? This study from the University of Reading suggests not. "The university’s markers – who were not told about the project – flagged only one of the 33 entries." https://t.co/dTuSCTJnHF" / X

x.com/AnnaRMills/status/1806033830027182423

Jun 26, 2024

Ethan Mollick on X: "Researchers secretly added AI-created the papers to the exam pool: “We found that 94% of our AI submissions were undetected. The grades awarded to our AI submissions were on average half a grade boundary higher than that achieved by real students.“ https://t.co/z8IX14133B. https://t.co/JDmET3q7pw" / X

x.com/emollick/status/1806040241104470228

Jun 26, 2024

(1) https://arxiv.org/abs/2406.13121 - Search / X

x.com/search?q=https%3A%2F%2Farxiv.org%2Fabs%2F2406.13121&src=typed_query&f=top

Jun 26, 2024

Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

arxiv.org/abs/2406.13121

Jun 26, 2024

(2) Posts liked by mark erdmann (@markerdmann) / X

x.com/markerdmann/likes

Jun 25, 2024

Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs

arxiv.org/abs/2406.11695

Jun 25, 2024

(1) Dan Hendrycks on X: "Nat's right so I think I'm going to make 2-3 more benchmarks to replace MMLU and MATH." / X

x.com/DanHendrycks/status/1804929811703591345

Jun 24, 2024

(2) Eugene Yan (SF 22 - 28 June) on X: "i previously spoke to a team who only used embedding-based retrieval. i suggested, insisted, they try lexical search. at our next chat, they shared that 80% of the relevant docs now come from lexical search. i.e., without lexical search they were missing 80% of the juice for RAG. https://t.co/2N92Xygw1G" / X

x.com/eugeneyan/status/1804270554033328359

Jun 21, 2024

Detecting hallucinations in large language models using semantic entropy - Nature

www.nature.com/articles/s41586-024-07421-0

Jun 21, 2024

(2) Andrej Karpathy on X: "The way to think about asking a factual question to an LLM is that it's a bit like asking a person who read about the topic previously, but they are not allowed to reference any material and have to answer just from memory. LLMs are a lot better at memorizing than humans, but the" / X

x.com/karpathy/status/1804208334033371213

Jun 21, 2024

How to use Claude’s artifacts

medium.com/@simeon.emanuilov/how-to-use-claudes-artifacts-908835dbd96a

Jun 21, 2024

(1) Keyon Vafa on X: "New paper: How can you tell if a transformer has the right world model? We trained a transformer to predict directions for NYC taxi rides. The model was good. It could find shortest paths between new points But had it built a map of NYC? We reconstructed its map and found this: https://t.co/5z6sglnRIQ" / X

x.com/keyonV/status/1803838591371555252

Jun 20, 2024

2406.04692v1.pdf

arxiv.org/pdf/2406.04692

Jun 20, 2024

Aqua Voice - Voice-only Document Editor

withaqua.com/?ref=upstract.com

Jun 20, 2024

(1) Rob Wiblin on X: ""The results were otherworldly. Claude is fully capable of acting as a Supreme Court Justice right now. When used as a law clerk, Claude is easily as insightful and accurate as human clerks, while towering over humans in efficiency." https://t.co/tfdYtHSqnT https://t.co/83t85g5Wtp" / X

x.com/robertwiblin/status/1803388400084381787

Jun 20, 2024

NeurIPS-2021-attention-approximates-sparse-distributed-memory-Paper.pdf

proceedings.neurips.cc/paper_files/paper/2021/file/8171ac2c5544a5cb54ac0f38bf477af4-Paper.pdf

Jun 19, 2024

Lienid on X: "@mikeknoop been clear for a while that transformers have nailed associative memory. it even maps to a biologically plausible mechanism. frankly i’m not sure how people haven’t come to this conclusion yet https://t.co/remVwHBlYd" / X

x.com/0xLienid/status/1803530958207066114

Jun 19, 2024

Mike Knoop on X: "If superintelligence is human-level skill acquisition (AGI) plus narrow super-human characteristics, like memorization or inference speed, this is plausibly within reach. The former still requires new 0 to 1 ideas (see ARC Prize) but the latter already exists." / X

x.com/mikeknoop/status/1803528066616246478

Jun 19, 2024

Rohan Paul on X: "Transformer models can learn robust reasoning skills (beyond those of GPT-4-Turbo and Gemini-1.5-Pro) through a stage of training dynamics that continues far beyond the point of overfitting (i.e. with 'Grokking') 🤯 For a challenging reasoning task with a large search space,… https://t.co/Tl9bND5PHq" / X

x.com/rohanpaul_ai/status/1803478727067603055

Jun 19, 2024

Gary Basin 🍍 on X: "Why deep learning is ngmi in one graph https://t.co/lZwvEnXy8H" / X

x.com/garybasin/status/1802465723215737112

Jun 19, 2024

davidad 🎇 on X: "When @GaryMarcus and others (including myself) say that LLMs do not “reason,” we mean something quite specific, but it’s hard to put one’s finger on it, until now. Specifically, Transformers do not generalize algebraic structures out of distribution." / X

x.com/davidad/status/1802576341470216362

Jun 19, 2024

abhav on X: "Something weird is afoot. quick story involving: - open source "reasoning" SOTA LLM (only 7B params, and from china!) - big math doing small math - a $1M opportunity well, almost $1m and that is really tough. anyway, strap in 🍿🧵" / X

x.com/abhav_k/status/1802572167617626399

Jun 19, 2024

Hesam on X: "Reasoning with LLM is Hard! Large Language Models need help with generalized reasoning capabilities, and a key factor is how we prompt them. 📌 Traditional prompting methods such as Chain-of-Thought (CoT) or Tree-of-Thought (ToT) often require multiple assumptions or numerous… https://t.co/XtRVXNSydX" / X

x.com/itsHesamSheikh/status/1801934604334477355

Jun 19, 2024

AlphaMath Almost Zero: process Supervision without process

arxiv.org/abs/2405.03553

Jun 19, 2024

Aran Komatsuzaki on X: "@jeremyphoward Btw there are many other recent papers with LLM + MCTS for reasoning with successful results. Here are some interesting ones: - https://t.co/TpE92UMx2C - https://t.co/1Kh8rVyTat" / X

x.com/arankomatsuzaki/status/1803482585378726379

Jun 19, 2024

Alessio Fanelli on X: "How AI is eating Finance 📈 @vagabondjack is back on @latentspacepod! He shared all the AI Engineering wisdom he acquired while turning LLMs into AI thought partners @brightwaveio for customers with >$120B under management 💰 - Why he lost faith in long context windows - 3 https://t.co/AKJ82amHDC" / X

x.com/FanaHOVA/status/1800553625607155856

Jun 19, 2024

Xing Han Lu on X: "Announcing ⚡BM25S, a fast lexical retrieval library! 🏎️ Up to 500x faster than the most popular Python lib, matches @Elastic search (BM25 default) 🤗 First BM25 library that is directly integrated with @huggingface hub: load or save in 1 line! GitHub: https://t.co/iuQleXIGgX https://t.co/trNv0QbUao" / X

x.com/xhluca/status/1803100958408241597

Jun 19, 2024

Naomi Saphra on X: "Modern generative models are trained to imitate human experts, but can they actually beat those experts? Our new paper uses imitative chess agents to explore when a model can "transcend" its training distribution and outperform every human it's trained on. https://t.co/oKsIh5nVBk https://t.co/rA3TzmIXm7" / X

x.com/nsaphra/status/1803114822445465824

Jun 19, 2024

Patterns for Building LLM-based Systems & Products

eugeneyan.com/writing/llm-patterns/

Jun 19, 2024

Arvind Narayanan on X: "Tired: train/test leakage. Wired: benchmark contamination. Inspired: resample until answer is correct." / X

x.com/random_walker/status/1803392358093857127

Jun 19, 2024

(1) Alex Cheema - e/acc on X: "Llama 3 running locally on iPhone with MLX Built by @exolabs_ team @mo_baioumy h/t @awnihannun MLX & @Prince_Canuma for the port https://t.co/4swkM7mOfI" / X

x.com/ac_crypto/status/1781061013716037741

Jun 19, 2024

2406.11741v1.pdf

arxiv.org/pdf/2406.11741

Jun 19, 2024

Context caching | Google AI for Developers | Google for Developers

ai.google.dev/gemini-api/docs/caching?lang=python

Jun 19, 2024

x.com/johnathanbi/status/1803096216299090267?s=12

Jun 19, 2024

Pass@k or Pass@1? · Issue #1 · trotsky1997/MathBlackBox

github.com/trotsky1997/MathBlackBox/issues/1

Jun 18, 2024

quickwit-oss/tantivy: Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

github.com/quickwit-oss/tantivy

Jun 18, 2024

Olympiad Solutions - Search / X

x.com/search?q=Olympiad%20Solutions&src=typed_query

Jun 18, 2024

The 100 Rep Squat Challenge

kettlebellaerobics.substack.com/p/the-100-rep-squat-challenge

Jun 18, 2024

Applied LLMs - What We’ve Learned From A Year of Building with LLMs

applied-llms.org/

Jun 18, 2024

(1) Terry Yue Zhuo on X: "In the past few months, we’ve seen SOTA LLMs saturating basic coding benchmarks with short and simplified coding tasks. It's time to enter the next stage of coding challenge under comprehensive and realistic scenarios! -- Here comes BigCodeBench, benchmarking LLMs on solving… https://t.co/w3Z6N5wnVk" / X

x.com/terryyuezhuo/status/1803076834520945117

Jun 18, 2024

(2) François Chollet on X: "I believe that program synthesis will solve reasoning. And I believe that deep learning will solve program synthesis (by guiding a discrete program search process). But I don't think you can go all that far with just prompting a LLM to generate end-to-end Python programs (even…" / X

x.com/fchollet/status/1803096195684012371

Jun 18, 2024

Caiming Xiong on X: "🎆I am pleased to announce the release of the latest version of the Salesforce Embedding Model (SFR-embedding-v2), which has reclaimed the top-1 position on the MTEB benchmark. ✨ Key Highlights: 🥇 Achieved the distinction of being the second model to surpass a 70+ performance… https://t.co/ucs4gXfp1v" / X

x.com/CaimingXiong/status/1802879572385714496

Jun 18, 2024

Debunking the Chessboard: Confronting GPTs Against Chess Engines to Estimate Elo Ratings and Assess Legal Move Abilities

blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/

Jun 18, 2024

Beyond the Basics of Retrieval for Augmenting Generation – Parlance

parlance-labs.com/education/rag/ben.html

Jun 18, 2024

TaskMeAnything

www.task-me-anything.org/

Jun 18, 2024

John David Pressman on X: "My problem with "transformers don't generalize algebraic structures and therefore don't reason" is that while I agree this is a real limitation there are important aspects of reason which these models in fact do and other methods don't. We may need to divide "reason" up." / X

x.com/jd_pressman/status/1802835378451185733

Jun 18, 2024