Automating Repetitive Work Tasks

Explore top LinkedIn content from expert professionals.

Brij kishore Pandey Brij kishore Pandey is an Influencer

AI Architect | AI Engineer | Generative AI | Agentic AI

693,434 followers 1y
Report this post
Starting Your CI/CD Journey 1. 𝗦𝘁𝗮𝗿𝘁 𝗦𝗺𝗮𝗹𝗹, 𝗧𝗵𝗶𝗻𝗸 𝗕𝗶𝗴 - Don't try to overhaul your entire codebase at once - Begin with a small project as your pilot - Gradually expand your CI/CD pipeline as you gain experience and confidence 2. 𝗚𝗲𝘁 𝗧𝗲𝗮𝗺 𝗕𝘂𝘆-𝗜𝗻 - CI/CD is a significant shift in workflow - ensure your team is on board - Educate your team on the benefits of CI/CD: - Faster time to market - Improved code quality - Reduced manual errors - Address concerns and foster a culture of continuous improvement 3. 𝗘𝗺𝗯𝗿𝗮𝗰𝗲 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗼𝗻 - The heart of CI/CD is automation - the more, the better - Look for opportunities to automate manual tasks in your development lifecycle Key Automation Milestones Strive to reach these crucial automation checkpoints in your CI/CD journey: 1. 𝗨𝗻𝗶𝘁 𝗧𝗲𝘀𝘁 𝗘𝘅𝗲𝗰𝘂𝘁𝗶𝗼𝗻 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗼𝗻 - Ensure all unit tests run automatically with each code change 2. 𝗕𝘂𝗶𝗹𝗱 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗼𝗻 - Automate your build process to create consistent, reproducible builds 3. 𝗖𝗼𝗱𝗲 𝗖𝗼𝘃𝗲𝗿𝗮𝗴𝗲 𝗖𝗵𝗲𝗰𝗸 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗼𝗻 - Automatically measure and report on code coverage for each build 4. 𝗖𝗼𝗱𝗲 𝗤𝘂𝗮𝗹𝗶𝘁𝘆 𝗖𝗵𝗲𝗰𝗸 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗼𝗻 - Implement automated code quality checks to maintain high standards 5. 𝗦𝗲𝗰𝘂𝗿𝗶𝘁𝘆 𝗦𝗰𝗮𝗻𝗻𝗶𝗻𝗴 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗼𝗻 - Integrate automated security scans to catch vulnerabilities early 6. 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗲𝗱 𝗗𝗲𝗽𝗹𝗼𝘆𝗺𝗲𝗻𝘁𝘀 𝘄𝗶𝘁𝗵 𝗚𝗮𝘁𝗶𝗻𝗴 - Set up automated deployments with quality gates to ensure only validated code reaches production 7. 𝗙𝗲𝗲𝗱𝗯𝗮𝗰𝗸 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗼𝗻 𝘁𝗼 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗧𝗲𝗮𝗺𝘀 - Establish automated feedback loops to keep production teams informed 8. 𝗕𝗶𝗻𝗮𝗿𝘆 𝗦𝘁𝗼𝗿𝗮𝗴𝗲 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗼𝗻 𝗶𝗻𝘁𝗼 𝗥𝗲𝗽𝗼 𝗠𝗮𝗻𝗮𝗴𝗲𝗿 - Automate the storage of build artifacts in a repository manager 9. 𝗜𝗻𝗳𝗿𝗮𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲 𝗦𝗲𝘁𝘂𝗽 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗼𝗻 - Implement Infrastructure as Code (IaC) to automate environment setups Pro Tips for CI/CD Success - 𝗖𝗼𝗻𝘁𝗶𝗻𝘂𝗼𝘂𝘀 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴: Stay updated with the latest CI/CD tools and best practices - 𝗠𝗲𝘁𝗿𝗶𝗰𝘀 𝗠𝗮𝘁𝘁𝗲𝗿: Track key performance indicators (KPIs) to measure the impact of your CI/CD implementation - 𝗜𝘁𝗲𝗿𝗮𝘁𝗲 𝗮𝗻𝗱 𝗜𝗺𝗽𝗿𝗼𝘃𝗲: Regularly review and refine your CI/CD pipeline based on team feedback and changing project needs How has implementing CI/CD transformed your development process? What challenges did you face, and how did you overcome them?
No more previous content

No more next content

Brij kishore Pandey Brij kishore Pandey is an Influencer

AI Architect | AI Engineer | Generative AI | Agentic AI

Starting Your CI/CD Journey 1. 𝗦𝘁𝗮𝗿𝘁 𝗦𝗺𝗮𝗹𝗹, 𝗧𝗵𝗶𝗻𝗸 𝗕𝗶𝗴 - Don't try to overhaul your entire codebase at once - Begin with a small project as your pilot - Gradually expand your CI/CD pipeline as you gain experience and confidence 2. 𝗚𝗲𝘁 𝗧𝗲𝗮𝗺 𝗕𝘂𝘆-𝗜𝗻 - CI/CD is a significant shift in workflow - ensure your team is on board - Educate your team on the benefits of CI/CD: - Faster time to market - Improved code quality - Reduced manual errors - Address concerns and foster a culture of continuous improvement 3. 𝗘𝗺𝗯𝗿𝗮𝗰𝗲 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗼𝗻 - The heart of CI/CD is automation - the more, the better - Look for opportunities to automate manual tasks in your development lifecycle Key Automation Milestones Strive to reach these crucial automation checkpoints in your CI/CD journey: 1. 𝗨𝗻𝗶𝘁 𝗧𝗲𝘀𝘁 𝗘𝘅𝗲𝗰𝘂𝘁𝗶𝗼𝗻 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗼𝗻 - Ensure all unit tests run automatically with each code change 2. 𝗕𝘂𝗶𝗹𝗱 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗼𝗻 - Automate your build process to create consistent, reproducible builds 3. 𝗖𝗼𝗱𝗲 𝗖𝗼𝘃𝗲𝗿𝗮𝗴𝗲 𝗖𝗵𝗲𝗰𝗸 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗼𝗻 - Automatically measure and report on code coverage for each build 4. 𝗖𝗼𝗱𝗲 𝗤𝘂𝗮𝗹𝗶𝘁𝘆 𝗖𝗵𝗲𝗰𝗸 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗼𝗻 - Implement automated code quality checks to maintain high standards 5. 𝗦𝗲𝗰𝘂𝗿𝗶𝘁𝘆 𝗦𝗰𝗮𝗻𝗻𝗶𝗻𝗴 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗼𝗻 - Integrate automated security scans to catch vulnerabilities early 6. 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗲𝗱 𝗗𝗲𝗽𝗹𝗼𝘆𝗺𝗲𝗻𝘁𝘀 𝘄𝗶𝘁𝗵 𝗚𝗮𝘁𝗶𝗻𝗴 - Set up automated deployments with quality gates to ensure only validated code reaches production 7. 𝗙𝗲𝗲𝗱𝗯𝗮𝗰𝗸 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗼𝗻 𝘁𝗼 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗧𝗲𝗮𝗺𝘀 - Establish automated feedback loops to keep production teams informed 8. 𝗕𝗶𝗻𝗮𝗿𝘆 𝗦𝘁𝗼𝗿𝗮𝗴𝗲 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗼𝗻 𝗶𝗻𝘁𝗼 𝗥𝗲𝗽𝗼 𝗠𝗮𝗻𝗮𝗴𝗲𝗿 - Automate the storage of build artifacts in a repository manager 9. 𝗜𝗻𝗳𝗿𝗮𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲 𝗦𝗲𝘁𝘂𝗽 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗼𝗻 - Implement Infrastructure as Code (IaC) to automate environment setups Pro Tips for CI/CD Success - 𝗖𝗼𝗻𝘁𝗶𝗻𝘂𝗼𝘂𝘀 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴: Stay updated with the latest CI/CD tools and best practices - 𝗠𝗲𝘁𝗿𝗶𝗰𝘀 𝗠𝗮𝘁𝘁𝗲𝗿: Track key performance indicators (KPIs) to measure the impact of your CI/CD implementation - 𝗜𝘁𝗲𝗿𝗮𝘁𝗲 𝗮𝗻𝗱 𝗜𝗺𝗽𝗿𝗼𝘃𝗲: Regularly review and refine your CI/CD pipeline based on team feedback and changing project needs How has implementing CI/CD transformed your development process? What challenges did you face, and how did you overcome them?

35 Comments

Like Comment
35 Comments
Like Comment
Rahul Agarwal

Staff ML Engineer | Meta, Roku, Walmart | 1:1 @ topmate.io/MLwhiz

44,541 followers 8mo
Report this post
Few Lessons from Deploying and Using LLMs in Production Deploying LLMs can feel like hiring a hyperactive genius intern—they dazzle users while potentially draining your API budget. Here are some insights I’ve gathered: 1. “Cheap” is a Lie You Tell Yourself: Cloud costs per call may seem low, but the overall expense of an LLM-based system can skyrocket. Fixes: - Cache repetitive queries: Users ask the same thing at least 100x/day - Gatekeep: Use cheap classifiers (BERT) to filter “easy” requests. Let LLMs handle only the complex 10% and your current systems handle the remaining 90%. - Quantize your models: Shrink LLMs to run on cheaper hardware without massive accuracy drops - Asynchronously build your caches — Pre-generate common responses before they’re requested or gracefully fail the first time a query comes and cache for the next time. 2. Guard Against Model Hallucinations: Sometimes, models express answers with such confidence that distinguishing fact from fiction becomes challenging, even for human reviewers. Fixes: - Use RAG - Just a fancy way of saying to provide your model the knowledge it requires in the prompt itself by querying some database based on semantic matches with the query. - Guardrails: Validate outputs using regex or cross-encoders to establish a clear decision boundary between the query and the LLM’s response. 3. The best LLM is often a discriminative model: You don’t always need a full LLM. Consider knowledge distillation: use a large LLM to label your data and then train a smaller, discriminative model that performs similarly at a much lower cost. 4. It's not about the model, it is about the data on which it is trained: A smaller LLM might struggle with specialized domain data—that’s normal. Fine-tune your model on your specific data set by starting with parameter-efficient methods (like LoRA or Adapters) and using synthetic data generation to bootstrap training. 5. Prompts are the new Features: Prompts are the new features in your system. Version them, run A/B tests, and continuously refine using online experiments. Consider bandit algorithms to automatically promote the best-performing variants. What do you think? Have I missed anything? I’d love to hear your “I survived LLM prod” stories in the comments!

46 Comments
Like Comment
Matt Wood Matt Wood is an Influencer

CTIO, PwC

75,593 followers 1y
Report this post
Now in preview - Amazon Q Code Transformation. Automate the process of code migrations, save time, and focus on innovation. Let's dive in! Upgrading applications is a common, annoying, time consuming, complex process. Teams need to understand new features, update dependencies, and maintain compatibility, and at the end of the day - at best - you end up back where you started in terms of capability. It detracts from focusing on new feature development, experimentation, and the fun of building. ⬢ Enter Amazon Q Code Transformation, a new capabilitiy that automates the upgrade process, starting from analyzing the code to applying necessary updates and fixes. It supports Java 8 and 11 upgrades to Java 17 (and we'll include more versions and languages in the future). 🛠️ Code Transformation in Q combines program analysis and LLMs to understand and apply upgrades. It operates in a secure environment, ensuring data privacy and control over the upgrade process. 🚀 Efficiency Gains: The automated process can complete most applications' upgrades in minutes to hours, a significant improvement over the days to weeks it could take manually. This efficiency enables developers to focus on new feature development rather than maintenance. ⚡️ Future Enhancements: While currently focused on open-source packages, future versions will support internal package upgrades and other languages and platforms, including migrating Windows .NET applications to .NET Core on Linux. 🌟 Impact: Internally, the tool has been used to migrate over a thousand Java applications in just a few days, saving an estimated ten years of software development effort compared to manual upgrades. General chatbots are fun (check for human reviews before you use them for anything important!) - but for my money, channeling AI to help with specific task automation is a real sweet spot for actually applying generative AI to real problems, today. Q makes this as easy as typing '/transform' - fire it up! Linked below is a great presentation from Re:Invent by Vishvesh Sahasrabudhe and Jas Chhabra which introduces Q and Code Transformation.
No more previous content

No more next content

Matt Wood Matt Wood is an Influencer

CTIO, PwC

Now in preview - Amazon Q Code Transformation. Automate the process of code migrations, save time, and focus on innovation. Let's dive in! Upgrading applications is a common, annoying, time consuming, complex process. Teams need to understand new features, update dependencies, and maintain compatibility, and at the end of the day - at best - you end up back where you started in terms of capability. It detracts from focusing on new feature development, experimentation, and the fun of building. ⬢ Enter Amazon Q Code Transformation, a new capabilitiy that automates the upgrade process, starting from analyzing the code to applying necessary updates and fixes. It supports Java 8 and 11 upgrades to Java 17 (and we'll include more versions and languages in the future). 🛠️ Code Transformation in Q combines program analysis and LLMs to understand and apply upgrades. It operates in a secure environment, ensuring data privacy and control over the upgrade process. 🚀 Efficiency Gains: The automated process can complete most applications' upgrades in minutes to hours, a significant improvement over the days to weeks it could take manually. This efficiency enables developers to focus on new feature development rather than maintenance. ⚡️ Future Enhancements: While currently focused on open-source packages, future versions will support internal package upgrades and other languages and platforms, including migrating Windows .NET applications to .NET Core on Linux. 🌟 Impact: Internally, the tool has been used to migrate over a thousand Java applications in just a few days, saving an estimated ten years of software development effort compared to manual upgrades. General chatbots are fun (check for human reviews before you use them for anything important!) - but for my money, channeling AI to help with specific task automation is a real sweet spot for actually applying generative AI to real problems, today. Q makes this as easy as typing '/transform' - fire it up! Linked below is a great presentation from Re:Invent by Vishvesh Sahasrabudhe and Jas Chhabra which introduces Q and Code Transformation.

17 Comments

Like Comment
17 Comments
Like Comment
Greg Coquillo Greg Coquillo is an Influencer

Product Leader @AWS | Startup Investor | 2X Linkedin Top Voice for AI, Data Science, Tech, and Innovation | Quantum Computing & Web 3.0 | I build software that scales AI/ML Network infrastructure

216,400 followers 2mo
Report this post
AI is changing the way we code but reproducing algorithms from research papers or building full applications still takes months. DeepCode, an open-source multi-agent coding platform from HKU Data Intelligence Lab, is redefining software development with automation, orchestration, and intelligence. What is DeepCode? DeepCode is an AI-powered agentic coding system designed to automate code generation, accelerate research-to-production workflows, and streamline full-stack development. With 6.3K GitHub stars, it’s one of the most promising open coding initiatives today. 🔹Key Features - Paper2Code: Converts research papers into production-ready code. - Text2Web: Transforms plain text into functional, appealing front-end interfaces. - Text2Backend: Generates scalable, efficient back-end systems from text prompts. - Multi-Agent Workflow: Orchestrates specialized agents to handle parsing, planning, indexing, and code generation. 🔹Why It Matters Traditional development slows down with repetitive coding, research bottlenecks, and implementation complexity. DeepCode removes these inefficiencies, letting developers, researchers, and product teams focus on innovation rather than boilerplate implementation. 🔹Technical Edge - Research-to-Production Pipeline: Extracts algorithms from papers and builds optimized implementations. - Natural Language Code Synthesis: Context-aware, multi-language code generation. - Automated Prototyping: Generates full app structures including databases, APIs, and frontends. - Quality Assurance Automation: Integrated testing, static analysis, and documentation. - CodeRAG System: Retrieval-augmented generation with dependency graph analysis for smarter code suggestions. 🔹Multi-Agent Architecture DeepCode employs agents for orchestration, document parsing, code planning, repository mining, indexing, and code generation all coordinated for seamless delivery. 🔹Getting Started 1. Install DeepCode: pip install deepcode-hku 2. Configure APIs for OpenAI, Claude, or search integrations. 3. Launch via web UI or CLI. 4. Input requirements or research papers and receive complete, testable codebases. With DeepCode, the gap between research, requirements, and production-ready code is closing faster than ever. #DeepCode
No more previous content

No more next content

Greg Coquillo Greg Coquillo is an Influencer

Product Leader @AWS | Startup Investor | 2X Linkedin Top Voice for AI, Data Science, Tech, and Innovation | Quantum Computing & Web 3.0 | I build software that scales AI/ML Network infrastructure

AI is changing the way we code but reproducing algorithms from research papers or building full applications still takes months. DeepCode, an open-source multi-agent coding platform from HKU Data Intelligence Lab, is redefining software development with automation, orchestration, and intelligence. What is DeepCode? DeepCode is an AI-powered agentic coding system designed to automate code generation, accelerate research-to-production workflows, and streamline full-stack development. With 6.3K GitHub stars, it’s one of the most promising open coding initiatives today. 🔹Key Features - Paper2Code: Converts research papers into production-ready code. - Text2Web: Transforms plain text into functional, appealing front-end interfaces. - Text2Backend: Generates scalable, efficient back-end systems from text prompts. - Multi-Agent Workflow: Orchestrates specialized agents to handle parsing, planning, indexing, and code generation. 🔹Why It Matters Traditional development slows down with repetitive coding, research bottlenecks, and implementation complexity. DeepCode removes these inefficiencies, letting developers, researchers, and product teams focus on innovation rather than boilerplate implementation. 🔹Technical Edge - Research-to-Production Pipeline: Extracts algorithms from papers and builds optimized implementations. - Natural Language Code Synthesis: Context-aware, multi-language code generation. - Automated Prototyping: Generates full app structures including databases, APIs, and frontends. - Quality Assurance Automation: Integrated testing, static analysis, and documentation. - CodeRAG System: Retrieval-augmented generation with dependency graph analysis for smarter code suggestions. 🔹Multi-Agent Architecture DeepCode employs agents for orchestration, document parsing, code planning, repository mining, indexing, and code generation all coordinated for seamless delivery. 🔹Getting Started 1. Install DeepCode: pip install deepcode-hku 2. Configure APIs for OpenAI, Claude, or search integrations. 3. Launch via web UI or CLI. 4. Input requirements or research papers and receive complete, testable codebases. With DeepCode, the gap between research, requirements, and production-ready code is closing faster than ever. #DeepCode

65 Comments

Like Comment
65 Comments
Like Comment
Aishwarya Srinivasan Aishwarya Srinivasan is an Influencer

599,076 followers 4mo
Report this post
Are you making a choice about the best LLM to use for building your AI Agent? You may have seen many benchmarks that reflect performance on math problems, exam papers and language reasoning but what about building AI Agents and practical use-cases? Very few test real agents doing real work. I found this great AI Agent Leaderboard developed by Galileo that solves that gap! This is the closest we are to measuring real-world model performance. Why does this Matter ⁉️ Most AI Agents are already being tasked with booking appointments, processing documents, and making decisions in workflows. But most current benchmarks don’t measure whether agents can actually do this well. They focus on static academic tasks like MMLU or GSM8K not on what happens in production environments. The Galileo Agent Leaderboard measures what truly matters when you deploy agents: → Tool Selection Quality (TSQ) – Can the agent choose the right tool and parameters? → Action Completion (AC) – Can the agent actually finish a multi-step task correctly, across domains like banking, healthcare, telecom, and insurance? It’s one of the first benchmarks that combines accuracy, safety, and cost-effectiveness for agents operating in real-world business workflows. Why is this important for you ⁉️ If you’re building with AI agents, this helps you answer critical questions: → Which model handles tool use and decision-making best? → How do different models compare in completing full tasks, not just responding with text? → What are the trade-offs between model cost, task completion, and reliability? Galileo has also open-sourced parts of the evaluation stack, making it easier for teams to run their own assessments. My favourite feature - The ability to filter the leaderboard by industry such as banking, investment and healthcare. If you’re working on agent systems and are leading an organization interested in deploying agents in production, this is a benchmark worth checking out. #AI #AgenticAI #Agents #LLM #AIEngineering #AutonomousAgents #EnterpriseAI #GalileoAI #AIinProduction #GalileoPartner
No more previous content

No more next content

Aishwarya Srinivasan Aishwarya Srinivasan is an Influencer

Are you making a choice about the best LLM to use for building your AI Agent? You may have seen many benchmarks that reflect performance on math problems, exam papers and language reasoning but what about building AI Agents and practical use-cases? Very few test real agents doing real work. I found this great AI Agent Leaderboard developed by Galileo that solves that gap! This is the closest we are to measuring real-world model performance. Why does this Matter ⁉️ Most AI Agents are already being tasked with booking appointments, processing documents, and making decisions in workflows. But most current benchmarks don’t measure whether agents can actually do this well. They focus on static academic tasks like MMLU or GSM8K not on what happens in production environments. The Galileo Agent Leaderboard measures what truly matters when you deploy agents: → Tool Selection Quality (TSQ) – Can the agent choose the right tool and parameters? → Action Completion (AC) – Can the agent actually finish a multi-step task correctly, across domains like banking, healthcare, telecom, and insurance? It’s one of the first benchmarks that combines accuracy, safety, and cost-effectiveness for agents operating in real-world business workflows. Why is this important for you ⁉️ If you’re building with AI agents, this helps you answer critical questions: → Which model handles tool use and decision-making best? → How do different models compare in completing full tasks, not just responding with text? → What are the trade-offs between model cost, task completion, and reliability? Galileo has also open-sourced parts of the evaluation stack, making it easier for teams to run their own assessments. My favourite feature - The ability to filter the leaderboard by industry such as banking, investment and healthcare. If you’re working on agent systems and are leading an organization interested in deploying agents in production, this is a benchmark worth checking out. #AI #AgenticAI #Agents #LLM #AIEngineering #AutonomousAgents #EnterpriseAI #GalileoAI #AIinProduction #GalileoPartner

47 Comments

Like Comment
47 Comments
Like Comment
Sahar Mor

I help researchers and builders make sense of AI | ex-Stripe | aitidbits.ai | Angel Investor

40,981 followers 5mo
Report this post
Developers today don’t just write code—they manage intelligent coding agents. That’s why I open-sourced SimulateDev, a tool that runs AI-powered coding IDEs like Cursor, Claude Code, and Windsurf to automatically build features, fix bugs, and generate pull requests across any GitHub repo. Imagine a swarm of specialized AI agents collaborating on your codebase—a Planner decomposing complex tasks, a Coder implementing the solutions, and a Tester validating the output. This is already how engineers at top AI labs like Anthropic and OpenAI work, and SimulateDev brings this collaborative approach to everyone by orchestrating multiple agents like a real engineering team. Key capabilities: (1) Multi-agent workflows - coordinate agents with defined roles (Planner, Coder, Tester), each bringing distinct strengths. (2) Universal compatibility - works with Cursor, Windsurf, and Claude Code (with Codex, Devin, and Factory support on the way). (3) Automated PR creation - Clone → Analyze → Implement → Test → Create PR, all automated. I’ve already tested SimulateDev on eight widely used open-source projects, where it automatically opened pull requests, some of which have already been approved and merged. (PR links in the comments) What’s next? Integrating web-based coding agents like Cognition’s Devin and OpenAI’s Codex, along with remote execution support (e.g., @Daytona or @E2B), enabling coding agents to run continuously in the background. Repo (Apache 2.0) https://lnkd.in/exn2evFs

5 Comments
Like Comment
Graham Neubig

--

15,928 followers 11mo
Report this post
How far are we from having competent AI co-workers that can perform tasks as varied as software development, project management, administration, and data science? In our new paper, we introduce TheAgentCompany, a benchmark for AI agents on consequential real-world tasks. Why is this benchmark important? Right now it is unclear how effective AI is at accelerating or automating real-world work. We hear statements like: > AI is overhyped, doesn’t reason, and doesn’t generalize to new tasks > AGI will automate all human work in the next few years This question has implications for: - Companies: to understand where to incorporate AI in workflows - Workers: to get a grounded sense of what AI can and cannot do - Policymakers: to understand effects of AI on the labor market How can we begin on it? In TheAgentCompany, we created a simulated software company with tasks inspired by real-world work. We created baseline agents, and evaluated their ability to solve these tasks. This benchmark is first of its kind with respect to versatility, practicality, and realism of tasks. TheAgentCompany features four internal web sites: - GitLab: for storing source code (like GitHub) - Plane: for doing task management (like Jira) - OwnCloud: for storing company docs (like Google Drive) - RocketChat: for chatting with co-workers (like Slack) Based on these sites, we created 175 tasks in the domains of: - Administration - Data science - Software development - Human resources - Project management - Finance We implemented a baseline agent that can web browse and write/execute code to solve these tasks. This was implemented using the open-source OpenHands framework for full reproducibility (https://lnkd.in/g4VhSi9a). Based on this agent, we evaluated many LMs, Claude, Gemini, GPT-4o, Nova, Llama, and Qwen. We evaluated both success metrics and cost. Results are striking: the most successful agent w/ Claude was able to successfully solve 24% of the diverse real-world tasks that it was tasked with. Gemini-2.0-flash is strong at a competitive price point, and the open llama-3.3-70b model is remarkably competent. This paints a nuanced picture of the role of current AI agents in task automation. - Yes, they are powerful, and can perform 24% tasks similar to those in real-world work - No, they can not yet solve all tasks or replace any jobs entirely Further, there are many caveats to our evaluation: - This is all on simulated data - We focused on concrete, easily evaluable tasks - We focused only on tasks from one corner of the digital economy If TheAgentCompany interests you, please: - Read the paper: https://lnkd.in/gyQE-xZG - Visit the site to see the leaderboard or run your own eval: https://lnkd.in/gtBcmq87 And huge thanks to Fangzheng (Frank) Xu, Yufan S., and Boxuan Li for leading the project, and the many many co-authors for their tireless efforts over many months to make this happen.
No more previous content

No more next content

Graham Neubig

--

How far are we from having competent AI co-workers that can perform tasks as varied as software development, project management, administration, and data science? In our new paper, we introduce TheAgentCompany, a benchmark for AI agents on consequential real-world tasks. Why is this benchmark important? Right now it is unclear how effective AI is at accelerating or automating real-world work. We hear statements like: > AI is overhyped, doesn’t reason, and doesn’t generalize to new tasks > AGI will automate all human work in the next few years This question has implications for: - Companies: to understand where to incorporate AI in workflows - Workers: to get a grounded sense of what AI can and cannot do - Policymakers: to understand effects of AI on the labor market How can we begin on it? In TheAgentCompany, we created a simulated software company with tasks inspired by real-world work. We created baseline agents, and evaluated their ability to solve these tasks. This benchmark is first of its kind with respect to versatility, practicality, and realism of tasks. TheAgentCompany features four internal web sites: - GitLab: for storing source code (like GitHub) - Plane: for doing task management (like Jira) - OwnCloud: for storing company docs (like Google Drive) - RocketChat: for chatting with co-workers (like Slack) Based on these sites, we created 175 tasks in the domains of: - Administration - Data science - Software development - Human resources - Project management - Finance We implemented a baseline agent that can web browse and write/execute code to solve these tasks. This was implemented using the open-source OpenHands framework for full reproducibility (https://lnkd.in/g4VhSi9a). Based on this agent, we evaluated many LMs, Claude, Gemini, GPT-4o, Nova, Llama, and Qwen. We evaluated both success metrics and cost. Results are striking: the most successful agent w/ Claude was able to successfully solve 24% of the diverse real-world tasks that it was tasked with. Gemini-2.0-flash is strong at a competitive price point, and the open llama-3.3-70b model is remarkably competent. This paints a nuanced picture of the role of current AI agents in task automation. - Yes, they are powerful, and can perform 24% tasks similar to those in real-world work - No, they can not yet solve all tasks or replace any jobs entirely Further, there are many caveats to our evaluation: - This is all on simulated data - We focused on concrete, easily evaluable tasks - We focused only on tasks from one corner of the digital economy If TheAgentCompany interests you, please: - Read the paper: https://lnkd.in/gyQE-xZG - Visit the site to see the leaderboard or run your own eval: https://lnkd.in/gtBcmq87 And huge thanks to Fangzheng (Frank) Xu, Yufan S., and Boxuan Li for leading the project, and the many many co-authors for their tireless efforts over many months to make this happen.

8 Comments

Like Comment
8 Comments
Like Comment
Pan Wu Pan Wu is an Influencer

Senior Data Science Manager at Meta

50,000 followers 1y
Report this post
Search is a crucial part of many modern internet products, as companies strive to improve the relevance of search results to enhance customer experience and increase retention. The first step in making these improvements is to measure relevance accurately. This blog, written by the data scientist team at Faire, shares their approach to using large language models (LLMs) to measure semantic relevance in search. - To define semantic relevance, the team uses a tiered approach based on the ESCI framework, which classifies each search result as "Exact," "Substitute," "Complement," or "Irrelevant." This classification allows for flexible relevance labeling and provides flexibility in fitting various downstream application needs. - To measure semantic relevance, the team initially relied on human annotators. However, this method was costly and slow, providing only a general measurement on a monthly cadence. With recent advancements in large language models (LLMs), the team transitioned to using these models to assess the relevance between search queries and products automatically. They fine-tuned a leading LLM model to align with the human labelers and measure agreement. The higher the agreement, the better the LLM performance. This LLM could then scale out much more effectively to provide daily evaluations of search's semantic performance. - The team’s LLM approach underwent multiple iterations, including adopting more advanced models (e.g., LLaMA 3) and more complex techniques (like quantization and horizontal scaling). With these efforts, the solution reached reasonable accuracy with good scalability and can serve the team’s purpose of measuring semantic performance to guide their improvements. This case study highlights that successful LLM applications need clear problem definitions, high-quality labeled data, and iterative model improvements, similar to standard machine learning product integration. It also demonstrates the potential of fine-tuned LLMs in the AI era, making it a compelling read! #machinelearning #datascience #llm #ai #search #relevance – – – Check out the "Snacks Weekly on Data Science" podcast and subscribe, where I explain in more detail the concepts discussed in this and future posts: -- Spotify: https://lnkd.in/gKgaMvbh -- Apple Podcast: https://lnkd.in/gj6aPBBY -- Youtube: https://lnkd.in/gcwPeBmR https://lnkd.in/gwaxRs2r

Fine-tuning Llama3 to measure semantic relevance in search craft.faire.com

3 Comments
Like Comment
Ross Dawson Ross Dawson is an Influencer

Futurist | Board advisor | Global keynote speaker | Humans + AI Leader | Bestselling author | Podcaster | LinkedIn Top Voice | Founder: AHT Group - Informivity - Bondi Innovation

34,046 followers 1y
Report this post
Building useful Knowledge Graphs will long be a Humans + AI endeavor. A recent paper lays out how best to implement automation, the specific human roles, and how these are combined. The paper, "From human experts to machines: An LLM supported approach to ontology and knowledge graph construction", provides clear lessons. These include: 🔍 Automate KG construction with targeted human oversight: Use LLMs to automate repetitive tasks like entity extraction and relationship mapping. Human experts should step in at two key points: early, to define scope and competency questions (CQs), and later, to review and fine-tune LLM outputs, focusing on complex areas where LLMs may misinterpret data. Combining automation with human-in-the-loop ensures accuracy while saving time. ❓ Guide ontology development with well-crafted Competency Questions (CQs): CQs define what the Knowledge Graph (KG) must answer, like "What preprocessing techniques were used?" Experts should create CQs to ensure domain relevance, and review LLM-generated CQs for completeness. Once validated, these CQs guide the ontology’s structure, reducing errors in later stages. 🧑⚖️ Use LLMs to evaluate outputs, with humans as quality gatekeepers: LLMs can assess KG accuracy by comparing answers to ground truth data, with humans reviewing outputs that score below a set threshold (e.g., 6/10). This setup allows LLMs to handle initial quality control while humans focus only on edge cases, improving efficiency and ensuring quality. 🌱 Leverage reusable ontologies and refine with human expertise: Start by using pre-built ontologies like PROV-O to structure the KG, then refine it with domain-specific details. Humans should guide this refinement process, ensuring that the KG remains accurate and relevant to the domain’s nuances, particularly in specialized terms and relationships. ⚙️ Optimize prompt engineering with iterative feedback: Prompts for LLMs should be carefully structured, starting simple and iterating based on feedback. Use in-context examples to reduce variability and improve consistency. Human experts should refine these prompts to ensure they lead to accurate entity and relationship extraction, combining automation with expert oversight for best results. These provide solid foundations to optimally applying human and machine capabilities to the very-important task of building robust and useful ontologies.
No more previous content

No more next content

Ross Dawson Ross Dawson is an Influencer

Futurist | Board advisor | Global keynote speaker | Humans + AI Leader | Bestselling author | Podcaster | LinkedIn Top Voice | Founder: AHT Group - Informivity - Bondi Innovation

Building useful Knowledge Graphs will long be a Humans + AI endeavor. A recent paper lays out how best to implement automation, the specific human roles, and how these are combined. The paper, "From human experts to machines: An LLM supported approach to ontology and knowledge graph construction", provides clear lessons. These include: 🔍 Automate KG construction with targeted human oversight: Use LLMs to automate repetitive tasks like entity extraction and relationship mapping. Human experts should step in at two key points: early, to define scope and competency questions (CQs), and later, to review and fine-tune LLM outputs, focusing on complex areas where LLMs may misinterpret data. Combining automation with human-in-the-loop ensures accuracy while saving time. ❓ Guide ontology development with well-crafted Competency Questions (CQs): CQs define what the Knowledge Graph (KG) must answer, like "What preprocessing techniques were used?" Experts should create CQs to ensure domain relevance, and review LLM-generated CQs for completeness. Once validated, these CQs guide the ontology’s structure, reducing errors in later stages. 🧑⚖️ Use LLMs to evaluate outputs, with humans as quality gatekeepers: LLMs can assess KG accuracy by comparing answers to ground truth data, with humans reviewing outputs that score below a set threshold (e.g., 6/10). This setup allows LLMs to handle initial quality control while humans focus only on edge cases, improving efficiency and ensuring quality. 🌱 Leverage reusable ontologies and refine with human expertise: Start by using pre-built ontologies like PROV-O to structure the KG, then refine it with domain-specific details. Humans should guide this refinement process, ensuring that the KG remains accurate and relevant to the domain’s nuances, particularly in specialized terms and relationships. ⚙️ Optimize prompt engineering with iterative feedback: Prompts for LLMs should be carefully structured, starting simple and iterating based on feedback. Use in-context examples to reduce variability and improve consistency. Human experts should refine these prompts to ensure they lead to accurate entity and relationship extraction, combining automation with expert oversight for best results. These provide solid foundations to optimally applying human and machine capabilities to the very-important task of building robust and useful ontologies.

14 Comments

Like Comment
14 Comments
Like Comment
Romano Roth Romano Roth is an Influencer

Global Chief of Cybernetic Transformation | Author of The Cybernetic Enterprise | Thought Leader | Executive Advisor | Keynote Speaker | Lecturer | Empowering Organizations through People, Process, Technology & AI

16,487 followers 10mo
Report this post
🤖 Coding just got smarter, faster, and more secure. Meet the 5 AI tools transforming software development in 2025! 1️⃣ GitHub Copilot Your ultimate coding assistant, GitHub Copilot. Key Features: 🟣Generates real-time code suggestions 🟣Easy integration with IDEs like VS Code and JetBrains. 🟣Offers custom LLM fine-tuning with personal repositories. Why Use It? 🟣85% of users feel more confident in their code quality. 🟣Tasks are completed 15% faster, with a 55% reduction in task time for Copilot users. 🟣Trusted by 55% of developers and over 50,000 businesses globally. 2️⃣ Cursor IDE A fork of VS Code with GPT-powered AI enhancements Key Features: 🟣Code Generation: Predicts and writes code blocks. 🟣Smart Rewrites: Automatically fixes syntax and formatting. 🟣Cursor Prediction: Anticipates navigation patterns for efficient coding. 🟣Integrated Chatbot: Context-aware guidance and suggestions. Why Use It? Trusted by top organizations like Samsung and OpenAI, Cursor IDE combines advanced AI features with VS Code’s flexibility, making it a strong contender in the AI-powered IDE space. 3️⃣ Tabnine If privacy and data security are a priority, Tabnine is your go-to coding assistant. Built on proprietary and external LLMs, it offers robust code completions. Key Features: 🟣Privacy-Focused: Trained on licensed code with GDPR and SOC-2 compliance. 🟣Transparent Data Use: Shares training data under NDA for added trust. 🟣Flexibility Why Use It? With over 1 million monthly users, Tabnine stands out for prioritizing security without sacrificing productivity. 4️⃣ Warp Terminal A modern twist on the CLI, Warp combines an IDE-like interface with AI-driven features to simplify terminal tasks. Key Features: 🟣Warp AI: Provides natural language command suggestions via ChatGPT. 🟣Agent Mode: Executes commands and resolves errors autonomously. 🟣Smart Command Completion: Suggests time-saving CLI commands. 🟣No-Retention Policy: Ensures complete data privacy. Why Use It? Warp is a game-changer for terminal users, offering features that save time and effort while enhancing productivity. 5️⃣ Replit Agent Replit Agent goes beyond coding assistance, acting as a virtual junior full-stack developer for building and deploying applications. Key Features: 🟣Natural Language Interface: Build complete applications with simple prompts. 🟣Infrastructure Setup: Deploy-ready configurations for various applications. 🟣Iterative Improvements: Add or modify features effortlessly. Why Use It? Although experimental and available in limited access, Replit Agent offers a glimpse into the future of AI-driven development 💡 These tools don’t just save time, they enable developers to focus on what truly matters: solving real-world problems and delivering exceptional products. #AI #SoftwareDevelopment #DeveloperTools #Productivity #TechInnovation
No more previous content

No more next content

Romano Roth Romano Roth is an Influencer

Global Chief of Cybernetic Transformation | Author of The Cybernetic Enterprise | Thought Leader | Executive Advisor | Keynote Speaker | Lecturer | Empowering Organizations through People, Process, Technology & AI

🤖 Coding just got smarter, faster, and more secure. Meet the 5 AI tools transforming software development in 2025! 1️⃣ GitHub Copilot Your ultimate coding assistant, GitHub Copilot. Key Features: 🟣Generates real-time code suggestions 🟣Easy integration with IDEs like VS Code and JetBrains. 🟣Offers custom LLM fine-tuning with personal repositories. Why Use It? 🟣85% of users feel more confident in their code quality. 🟣Tasks are completed 15% faster, with a 55% reduction in task time for Copilot users. 🟣Trusted by 55% of developers and over 50,000 businesses globally. 2️⃣ Cursor IDE A fork of VS Code with GPT-powered AI enhancements Key Features: 🟣Code Generation: Predicts and writes code blocks. 🟣Smart Rewrites: Automatically fixes syntax and formatting. 🟣Cursor Prediction: Anticipates navigation patterns for efficient coding. 🟣Integrated Chatbot: Context-aware guidance and suggestions. Why Use It? Trusted by top organizations like Samsung and OpenAI, Cursor IDE combines advanced AI features with VS Code’s flexibility, making it a strong contender in the AI-powered IDE space. 3️⃣ Tabnine If privacy and data security are a priority, Tabnine is your go-to coding assistant. Built on proprietary and external LLMs, it offers robust code completions. Key Features: 🟣Privacy-Focused: Trained on licensed code with GDPR and SOC-2 compliance. 🟣Transparent Data Use: Shares training data under NDA for added trust. 🟣Flexibility Why Use It? With over 1 million monthly users, Tabnine stands out for prioritizing security without sacrificing productivity. 4️⃣ Warp Terminal A modern twist on the CLI, Warp combines an IDE-like interface with AI-driven features to simplify terminal tasks. Key Features: 🟣Warp AI: Provides natural language command suggestions via ChatGPT. 🟣Agent Mode: Executes commands and resolves errors autonomously. 🟣Smart Command Completion: Suggests time-saving CLI commands. 🟣No-Retention Policy: Ensures complete data privacy. Why Use It? Warp is a game-changer for terminal users, offering features that save time and effort while enhancing productivity. 5️⃣ Replit Agent Replit Agent goes beyond coding assistance, acting as a virtual junior full-stack developer for building and deploying applications. Key Features: 🟣Natural Language Interface: Build complete applications with simple prompts. 🟣Infrastructure Setup: Deploy-ready configurations for various applications. 🟣Iterative Improvements: Add or modify features effortlessly. Why Use It? Although experimental and available in limited access, Replit Agent offers a glimpse into the future of AI-driven development 💡 These tools don’t just save time, they enable developers to focus on what truly matters: solving real-world problems and delivering exceptional products. #AI #SoftwareDevelopment #DeveloperTools #Productivity #TechInnovation

4 Comments

Like Comment
4 Comments
Like Comment

Automating Repetitive Work Tasks

More in Automating Repetitive Work Tasks

More Productivity topics

Explore categories