Proposal of AI Powered Malware Package Scanner in PyPI

Hello, Everyone

I Would like to propose a ai powered malware package scanner because anyone can upload malware to PyPi even people report them but cause damages.

This solution will be help keep PyPi safe for developers and minimize malware example soopsocks is a malicious package have been detected but some users have download it before deleted in PyPi via this solution this minimize the risk of malicious packages uploaded.

Do you have an implementation of this idea ready for testing?

There are already groups of security researchers who audit the packages PyPI. The malware reporting feature on PyPI is open to everyone so anyone who manages to build a malware scanning AI that isn’t as hopeless as all the general purpose AI powered antivirus services that have sprung up is free to download wheels, scan them and report the bad ones themselves.

Hello, @bwoodsend

No This Feature designed if someone upload a package ai will be scan it automatically if malware found will be deleted and send message to the maintener for keep community safe and protect developers before harm or damage done

(Somewhat tempted to move this to Help or Packaging, rather than Ideas, which is for ideas about the language.)

This post is of a kind that I don’t very much like, which just lobs a vague thought out without doing much to build towards any kind of productive conversation.

One possibility is that you intend to build this and feel you have a lot of relevant expertise? In which case, cool, let us know when you have a proof of concept! I’m all in favor of building things, even projects which I personally think are probably doomed to fail, since they may spawn or inspire other projects or teach us all what does or does not work.

Another is that you’d like to build this, but are wondering if you should? My advice would be not to. There are already ML-based tools, decades of research, and novel attacks showing up all the time. Classifying packages as malware is incredibly hard. If you’re approaching this as a fun hobby project, I think you’ll end up not enjoying it.

And then third, and this is the possibility that makes me dislike such posts, you might be of the impression that none of the people who work on maintaining these software systems have thought of using software to make their jobs easier? i.e. If the response you were hoping for was “Wow, nobody ever considered that! What an amazing idea! We’ll go build it!” then I think you’re badly out of step with the community. If you are genuinely just curious if an idea, even an obvious one, had been considered or is in use, then you can ask that as a question rather than phrasing it as a novel idea.

8 Likes

These researchers will have the process of downloading and scanning new uploads automated. The only meaningful difference between what you’re proposing and the status quo is you’d have the suspicious packages deleted straight away whereas a human currently has to verify the findings before acting. For that to have a chance of floating, you will need to demonstrate that the false positive rate is so exceptionally low that the proposal is worth the enormous disruption that even one misguided deletion could cause.

Hello, @bwoodsend

First AI need be trained to minimize false positive and negative before integrate it should first be trained before integration this can take years no problem after you finish of training after this integrate it to pypi.org

? What?

Yes AI / ML Models need to be trained, and that takes some years, I think everyone here knows that.

But if you yourself realise that it’s gonna take a long time to train a model that can accurately detect malware, and not disturb harmless project, then why propose something that will take a lot of time (and money), whilst not actually increasing the usefulness of Pypi that much.

IMO, if you download something on Pypi, without checking the code beforehand, and it turns out to be a piece of malware, well that’s your fault. Its the same situation that GitHub is facing. If they were to ban malware, that could even be ‘bad’. Sometimes malware is explicitly placed somewhere in order to e.g. inform others about certain kinds of malware (e.g. “This kind of virus works by 
”). You’d automatically lock that kind of content too. Humans are the most reliable way to check for malicious code in a project.

I disagree. Supply chain attacks to start, but you should also feel safe trusting a project like numpy by reputation without having the requisite skills to audit the source. Securing the supply chain is a difficult (maybe, in the perfect sense, impossible) task, but it is an active effort.

However, that’s all nuanced stuff, and I’m not sure this is the thread is the place to discuss it.

Hello, @sirosen

No this should be implemented because of threat of cyberattack of 2025+, attackers are leveraging AI defenders also should train ai by example in popular package the ai monitor the packages any obfuscated code in pypi popular package and safe are red flag will be triger an email to the maintener for take actions and remove malicious package for reduce the abuse of PyPi.

It doesn’t really matter what the motivation or specifics of how it’s to be done are. The answer is the same. If you think this can work and want to do it yourself then no-one is stopping you. If you’re expecting someone else to do the hard work on a whim because you suggested they should then forget it.

3 Likes