Mark Erdmann's Highlights on 'Marzena Karpinska on X: "Can #LLMs truly reason over loooong context? 🤔 NoCha asks LLMs to verify claims about *NEW* fictional books 🪄 📚 ⛔ LLMs that solve needle-in-the-haystack (~100%) struggle on NoCha! ⛔ None of 11 tested LLMs reach human performance → 97%. The best, #GPT-4o, gets only 55.8%. https://t.co/beuo7q9KIj" / X'

Ethan Mollick on X: "An experiment shows temperature has different effects on test-taking for men vs. women. For verbal tests, women beat men when it is over 70° F (maxing at 90°). In math, men & women do the same when its 80°. They suggest setting office thermostats higher! https://t.co/JJnSXs4VV4 https://t.co/LcxbIpdcqR" / X

x.com/emollick/status/1360763996484366336

Jul 11, 2024

(1) Morgan McGuire (Hiring 👋) on X: "RIP RAG “I think long context is definitely the future rather than RAG” On domain specialisation: “If you want a model for medical domain, legal domain…it (finetuning) definitely makes sense…finetuning can also be an alternative to RAG” Great episode, had to listen 0.75x 😂" / X

x.com/morgymcg/status/1810973158331072630

Jul 10, 2024

(1) Ethan Mollick on X: "We know good management is causal in part because, a decade ago, teams of consultants introduced basic management practices to some Indian plants & left others as a control. The practices boosted performance then. A followup shows about half the effects persist 10 years later! https://t.co/x4NROBczIJ" / X

x.com/emollick/status/1808922231302422840

Jul 4, 2024

(1) Lingming Zhang on X: "Introducing OpenAutoCoder-Agentless😺: A simple agentless solution solves 27.3% GitHub issues on SWE-bench Lite with ~$0.34 each, outperforming all open-source AI SW agents! It's fully open-source, try it out: 🧑‍💻https://t.co/AKyiZhmi7B 📝https://t.co/Oc4QCaQult https://t.co/gQDfCrLzs3" / X

x.com/LingmingZhang/status/1808501612056629569

Jul 4, 2024

Rohan Paul on X: "Quite an wild idea in this paper - Proposes a persona-driven data synthesis methodology using Persona Hub, a collection of 1 billion diverse personas, to create scalable and diverse synthetic data for LLM training and evaluation. 📌 Persona Hub contains 1bn+ personas derived https://t.co/r3pjDYa49u" / X

x.com/rohanpaul_ai/status/1808096574997770590

Jul 3, 2024

Pat Walls on X: "Free business idea for anyone that can code: Build a tiny saas around a SINGLE Zapier integration. Hear me out... So Zapier has 6,000 integrations. 6,000! 1. Find an integration that's (1) popular and (2) limited in functionality 2. Make it 10x better, cover edge cases, etc. https://t.co/TppPkhRDNf" / X

x.com/thepatwalls/status/1808150786804707755

Jul 3, 2024

AGI will drastically increase economies of scale — LessWrong

www.lesswrong.com/posts/Sn5NiiD5WBi4dLzaB/agi-will-drastically-increase-economies-of-scale

Jul 3, 2024

(3) Aidan McLau on X: "livebench (https://t.co/3fKC4vaoTE) is my new favorite eval: > contamination proof (new questions monthly) >tests model iq (unlike arena nowadays) >matches my intuition on relative perf quite well thanks @jpohhhh for the pointer https://t.co/fDXfG51wJe" / X

x.com/aidan_mclau/status/1807875944088326271

Jul 2, 2024

(1) Rohan Paul on X: "Brilliant new paper, HUGE for LLM's internalized knowledge 🔥 Out Of Context Learning > In Context Learning | Fine-tuning can teach new concepts better than ICL 📌 Finds a surprising capability of LLMs through a process called inductive out-of-context reasoning (OOCR). In the https://t.co/Ys5LUgLNKp" / X

x.com/rohanpaul_ai/status/1807774433550950816

Jul 1, 2024

(1) elvis on X: "This is one of the coolest ideas for scaling synthetic data that I've come across. Proposes 1 billion diverse personas to facilitate the creation of diverse synthetic data for different scenarios. It's easy to generate synthetic data but hard to scale up its diversity which is https://t.co/UR998d49hE" / X

x.com/omarsar0/status/1807827401122238628

Jul 1, 2024

Jeff Morris Jr. on X: "“How to ship fast as a small company looking for product-market fit” — by @varunsrin @farcaster_xyz is one of the fastest engineering teams I’ve ever seen… Here is how they operate: https://t.co/AARMozy2zy" / X

x.com/jmj/status/1806125131024183452

Jun 28, 2024

Steve Stewart-Williams on X: "The Big 5 personality traits strongly predict life satisfaction (r = .8 - one of the largest effects I’ve seen in a psychology paper). https://t.co/K2OaLCSD0L https://t.co/TlZiLN3ibe" / X

x.com/SteveStuWill/status/1806087660139946432

Jun 28, 2024

Greg Kamradt on X: "How do SOTA LLMs do on ARC Prize? We wanted to see how gpt-4o, claude sonnet, and gemini did on public tasks So we made a baseline template with @LangChainAI that tests them all Scores: * Claude Sonnet: 21% * gpt-4o: 9% * gemini 1.5: 8% https://t.co/6wXW8E3vOE" / X

x.com/GregKamradt/status/1806373849333653975

Jun 28, 2024

Ethan Mollick on X: "Two big lessons in the new OpenAI paper on training AI to detect AI bugs, 1) Cyborgs rule: AI detected more bugs than humans alone, but humans & AI working together had lower hallucination rates… 2)…for now: human error rates were also high. And read the highlighted conclusion. https://t.co/FbPTQVNPeI" / X

x.com/emollick/status/1806401500194672742

Jun 28, 2024

(1) BioBootloader on X: "1/ Thrilled to announce that our team has created the most advanced coding AI in the world, smashing the previous State-of-the-Art by solving 38.33% of SWE-bench Lite! MentatBot is not only the most accurate, but runs extremely quickly and is available for you to use today! https://t.co/FNJ7nPKaNi" / X

x.com/bio_bootloader/status/1806342922893394290

Jun 27, 2024

Finding GPT-4’s mistakes with GPT-4

openai.com/index/finding-gpt4s-mistakes-with-gpt-4/

Jun 27, 2024

(1) Greg Kamradt on X: "Last week @RyanPGreenblatt shared his gpt-4o based attempt on ARC-AGI We verified his score, excited to say his method got 42% on public tasks We’re publishing a secondary leaderboard to measure attempts like these So of course we tested gpt-4, claude sonnet, and gemini https://t.co/lyfIKNOioL" / X

x.com/GregKamradt/status/1806372523170533457

Jun 27, 2024

MultiOn on X: "Introducing Retrieve API: the best-in-class autonomous web information retrieval API. Developers love our Agent API ❤️. Since its launch, we have consistently received feedback that many use cases rely on intelligently leveraging the Agent API to retrieve information from the https://t.co/upOn8TflUj" / X

x.com/MultiOn_AI/status/1806007797030834521

Jun 27, 2024

University to Replace Students With ChatGPT After It Outperforms Them in Exams / X

x.com/i/trending/1806333778765271252

Jun 27, 2024

Ethan Mollick on X: "This isn't reliable enough yet, but it is a sign of what is coming: Claude 3.5 here's excel of my startup's finances, make a dashboard Add sensitivity analysis of key assumptions Run it as a Monte Carlo simulation Assuming a normal distribution, what are outcomes? All first try https://t.co/JBnudGihja" / X

x.com/emollick/status/1806321738734600337

Jun 27, 2024

Marzena Karpinska on X: "Can #LLMs truly reason over loooong context? 🤔 NoCha asks LLMs to verify claims about NEW fictional books 🪄 📚 ⛔ LLMs that solve needle-in-the-haystack (~100%) struggle on NoCha! ⛔ None of 11 tested LLMs reach human performance → 97%. The best, #GPT-4o, gets only 55.8%. https://t.co/beuo7q9KIj" / X

x.com/mar_kar_/status/1805660949023793224

Jun 26, 2024

Gizem Akdag on X: "Here is a Midjourney Style Reference that I think you'll like: --sref 3721090848. Save this code for calm, soft vibes. This one works really well with architecture, city, and interior design shots. Also, with a 35 mm film look, it gives a vintage feel. This time, I used Krea https://t.co/hKQhkCx6MI" / X

x.com/gizakdag/status/1785257036151656918

Jun 26, 2024

Tuhin Chakrabarty on X: "New paper with students @BarnardCollege on testing orthogonal thinking / abstract reasoning capabilities of Large Language Models using the fascinating yet frustratingly difficult @nytimes Connections game. #NLProc #LLMs #GPT4o #Claude3opus 🧵(1/n) https://t.co/jDfCbpPi2Z" / X

x.com/TuhinChakr/status/1805999559585227002

Jun 26, 2024

François Fleuret on X: "This is an argument often used (by me included), but I find it slightly unsatisfactory. Consider natural selection as a process that given tons of training data produce a 100k x 100k configuration of the game of life (that's roughly the information in human dna). 1/3" / X

x.com/francoisfleuret/status/1806037891757293626

Jun 26, 2024

clem 🤗 on X: "Pumped to announce the brand new open LLM leaderboard. We burned 300 H100 to re-run new evaluations like MMLU-pro for all major open LLMs! Some learning: - Qwen 72B is the king and Chinese open models are dominating overall - Previous evaluations have become too easy for recent" / X

x.com/ClementDelangue/status/1805989925080219927

Jun 26, 2024

Anna Mills, annamillsoer.bsky.social, she/her on X: "Can we tell when a student submission is AI? This study from the University of Reading suggests not. "The university’s markers – who were not told about the project – flagged only one of the 33 entries." https://t.co/dTuSCTJnHF" / X

x.com/AnnaRMills/status/1806033830027182423

Jun 26, 2024

Ethan Mollick on X: "Researchers secretly added AI-created the papers to the exam pool: “We found that 94% of our AI submissions were undetected. The grades awarded to our AI submissions were on average half a grade boundary higher than that achieved by real students.“ https://t.co/z8IX14133B. https://t.co/JDmET3q7pw" / X

x.com/emollick/status/1806040241104470228

Jun 26, 2024

(1) https://arxiv.org/abs/2406.13121 - Search / X

x.com/search?q=https%3A%2F%2Farxiv.org%2Fabs%2F2406.13121&src=typed_query&f=top

Jun 26, 2024

Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

arxiv.org/abs/2406.13121

Jun 26, 2024

(2) Posts liked by mark erdmann (@markerdmann) / X

x.com/markerdmann/likes

Jun 25, 2024

Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs

arxiv.org/abs/2406.11695

Jun 25, 2024

(1) Dan Hendrycks on X: "Nat's right so I think I'm going to make 2-3 more benchmarks to replace MMLU and MATH." / X

x.com/DanHendrycks/status/1804929811703591345

Jun 24, 2024

(2) Eugene Yan (SF 22 - 28 June) on X: "i previously spoke to a team who only used embedding-based retrieval. i suggested, insisted, they try lexical search. at our next chat, they shared that 80% of the relevant docs now come from lexical search. i.e., without lexical search they were missing 80% of the juice for RAG. https://t.co/2N92Xygw1G" / X

x.com/eugeneyan/status/1804270554033328359

Jun 21, 2024

Detecting hallucinations in large language models using semantic entropy - Nature

www.nature.com/articles/s41586-024-07421-0

Jun 21, 2024

(2) Andrej Karpathy on X: "The way to think about asking a factual question to an LLM is that it's a bit like asking a person who read about the topic previously, but they are not allowed to reference any material and have to answer just from memory. LLMs are a lot better at memorizing than humans, but the" / X

x.com/karpathy/status/1804208334033371213

Jun 21, 2024

How to use Claude’s artifacts

medium.com/@simeon.emanuilov/how-to-use-claudes-artifacts-908835dbd96a

Jun 21, 2024

(1) Keyon Vafa on X: "New paper: How can you tell if a transformer has the right world model? We trained a transformer to predict directions for NYC taxi rides. The model was good. It could find shortest paths between new points But had it built a map of NYC? We reconstructed its map and found this: https://t.co/5z6sglnRIQ" / X

x.com/keyonV/status/1803838591371555252

Jun 20, 2024

2406.04692v1.pdf

arxiv.org/pdf/2406.04692

Jun 20, 2024

Aqua Voice - Voice-only Document Editor

withaqua.com/?ref=upstract.com

Jun 20, 2024

(1) Rob Wiblin on X: ""The results were otherworldly. Claude is fully capable of acting as a Supreme Court Justice right now. When used as a law clerk, Claude is easily as insightful and accurate as human clerks, while towering over humans in efficiency." https://t.co/tfdYtHSqnT https://t.co/83t85g5Wtp" / X

x.com/robertwiblin/status/1803388400084381787

Jun 20, 2024

NeurIPS-2021-attention-approximates-sparse-distributed-memory-Paper.pdf

proceedings.neurips.cc/paper_files/paper/2021/file/8171ac2c5544a5cb54ac0f38bf477af4-Paper.pdf

Jun 19, 2024

Lienid on X: "@mikeknoop been clear for a while that transformers have nailed associative memory. it even maps to a biologically plausible mechanism. frankly i’m not sure how people haven’t come to this conclusion yet https://t.co/remVwHBlYd" / X

x.com/0xLienid/status/1803530958207066114

Jun 19, 2024

Mike Knoop on X: "If superintelligence is human-level skill acquisition (AGI) plus narrow super-human characteristics, like memorization or inference speed, this is plausibly within reach. The former still requires new 0 to 1 ideas (see ARC Prize) but the latter already exists." / X

x.com/mikeknoop/status/1803528066616246478

Jun 19, 2024

Rohan Paul on X: "Transformer models can learn robust reasoning skills (beyond those of GPT-4-Turbo and Gemini-1.5-Pro) through a stage of training dynamics that continues far beyond the point of overfitting (i.e. with 'Grokking') 🤯 For a challenging reasoning task with a large search space,… https://t.co/Tl9bND5PHq" / X

x.com/rohanpaul_ai/status/1803478727067603055

Jun 19, 2024

Gary Basin 🍍 on X: "Why deep learning is ngmi in one graph https://t.co/lZwvEnXy8H" / X

x.com/garybasin/status/1802465723215737112

Jun 19, 2024

davidad 🎇 on X: "When @GaryMarcus and others (including myself) say that LLMs do not “reason,” we mean something quite specific, but it’s hard to put one’s finger on it, until now. Specifically, Transformers do not generalize algebraic structures out of distribution." / X

x.com/davidad/status/1802576341470216362

Jun 19, 2024

abhav on X: "Something weird is afoot. quick story involving: - open source "reasoning" SOTA LLM (only 7B params, and from china!) - big math doing small math - a $1M opportunity well, almost $1m and that is really tough. anyway, strap in 🍿🧵" / X

x.com/abhav_k/status/1802572167617626399

Jun 19, 2024

Hesam on X: "Reasoning with LLM is Hard! Large Language Models need help with generalized reasoning capabilities, and a key factor is how we prompt them. 📌 Traditional prompting methods such as Chain-of-Thought (CoT) or Tree-of-Thought (ToT) often require multiple assumptions or numerous… https://t.co/XtRVXNSydX" / X

x.com/itsHesamSheikh/status/1801934604334477355

Jun 19, 2024

AlphaMath Almost Zero: process Supervision without process

arxiv.org/abs/2405.03553

Jun 19, 2024

Aran Komatsuzaki on X: "@jeremyphoward Btw there are many other recent papers with LLM + MCTS for reasoning with successful results. Here are some interesting ones: - https://t.co/TpE92UMx2C - https://t.co/1Kh8rVyTat" / X

x.com/arankomatsuzaki/status/1803482585378726379

Jun 19, 2024

AGI will drastically increase economies of scale — LessWrong

(3) Aidan McLau on X: "livebench (https://t.co/3fKC4vaoTE) is my new favorite eval: > contamination proof (new questions monthly) >tests model iq (unlike arena nowadays) >matches my intuition on relative perf quite well thanks @jpohhhh for the pointer https://t.co/fDXfG51wJe" / X

Jeff Morris Jr. on X: "“How to ship fast as a small company looking for product-market fit” — by @varunsrin @farcaster_xyz is one of the fastest engineering teams I’ve ever seen… Here is how they operate: https://t.co/AARMozy2zy" / X

Steve Stewart-Williams on X: "The Big 5 personality traits strongly predict life satisfaction (r = .8 - one of the largest effects I’ve seen in a psychology paper). https://t.co/K2OaLCSD0L https://t.co/TlZiLN3ibe" / X

Greg Kamradt on X: "How do SOTA LLMs do on ARC Prize? We wanted to see how gpt-4o, claude sonnet, and gemini did on public tasks So we made a baseline template with @LangChainAI that tests them all Scores: * Claude Sonnet: 21% * gpt-4o: 9% * gemini 1.5: 8% https://t.co/6wXW8E3vOE" / X

Finding GPT-4’s mistakes with GPT-4

University to Replace Students With ChatGPT After It Outperforms Them in Exams / X

François Fleuret on X: "This is an argument often used (by me included), but I find it slightly unsatisfactory. Consider natural selection as a process that given *tons of training data* produce a 100k x 100k configuration of the game of life (that's roughly the information in human dna). 1/3" / X

Anna Mills, annamillsoer.bsky.social, she/her on X: "Can we tell when a student submission is AI? This study from the University of Reading suggests not. "The university’s markers – who were not told about the project – flagged only one of the 33 entries." https://t.co/dTuSCTJnHF" / X

(1) https://arxiv.org/abs/2406.13121 - Search / X

Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

(2) Posts liked by mark erdmann (@markerdmann) / X

Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs

(1) Dan Hendrycks on X: "Nat's right so I think I'm going to make 2-3 more benchmarks to replace MMLU and MATH." / X

Detecting hallucinations in large language models using semantic entropy - Nature

How to use Claude’s artifacts

2406.04692v1.pdf

Aqua Voice - Voice-only Document Editor

NeurIPS-2021-attention-approximates-sparse-distributed-memory-Paper.pdf

Lienid on X: "@mikeknoop been clear for a while that transformers have nailed associative memory. it even maps to a biologically plausible mechanism. frankly i’m not sure how people haven’t come to this conclusion yet https://t.co/remVwHBlYd" / X

Mike Knoop on X: "If superintelligence is human-level skill acquisition (AGI) plus narrow super-human characteristics, like memorization or inference speed, this is plausibly within reach. The former still requires new 0 to 1 ideas (see ARC Prize) but the latter already exists." / X

Gary Basin 🍍 on X: "Why deep learning is ngmi in one graph https://t.co/lZwvEnXy8H" / X

davidad 🎇 on X: "When @GaryMarcus and others (including myself) say that LLMs do not “reason,” we mean something quite specific, but it’s hard to put one’s finger on it, until now. Specifically, Transformers do not generalize algebraic structures out of distribution." / X

abhav on X: "Something weird is afoot. quick story involving: - open source "reasoning" SOTA LLM (only 7B params, and from china!) - big math doing small math - a $1M opportunity well, almost $1m and that is really tough. anyway, strap in 🍿🧵" / X

AlphaMath Almost Zero: process Supervision without process

Aran Komatsuzaki on X: "@jeremyphoward Btw there are many other recent papers with LLM + MCTS for reasoning with successful results. Here are some interesting ones: - https://t.co/TpE92UMx2C - https://t.co/1Kh8rVyTat" / X

François Fleuret on X: "This is an argument often used (by me included), but I find it slightly unsatisfactory. Consider natural selection as a process that given tons of training data produce a 100k x 100k configuration of the game of life (that's roughly the information in human dna). 1/3" / X