- Notifications
You must be signed in to change notification settings - Fork 236
Generate sitemap for metacpan.org website #924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This introduces a script called bin/generate_sitemap.pl that creates XML files containing URLs for authors, releases and modules; a module that does all of the heavy lifting; a test script that exercies the module; and an updated robots.txt file.
bin/generate_sitemap.pl Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use FindBin, so we can call this script from any cwd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| Great work! That gives us a good starting point. We need to think about what we want to include in the sitemap, though. Right now, the script includes all modules, i.e. unauthorized, backpan, not latest. Same for releases. |
| I'm happy to filter the search so that the sitemap is better; let me know how the search can be adjusted and I'll change it. |
Previous version dumped all releases; releases are now limited to just the latest ones, bringing that file in line with the size of the others.
| Hi, I might be missing something, but why do we have a objectType distribution and release?
The code seems to generate a releases.xml.gz with the download_url. Is that intentionally? I haven't actually run the script, so all of the above what I understood from the source code. |
| @monken I wasn't sure about exactly what I should be picking out from the data structure, so went with what made sense to me. I'll be glad to change it if I have it wrong. |
| @talexb First off, thanks very much for all of the work you put into this. I apologize that I did not look at this sooner. Can you have a look at the changes I made to your branch here? https://github.com/CPAN-API/metacpan-web/tree/oalders/talexb/addSitemap There's still one issue to fix -- the module links. Currently they look like: https://metacpan.org/module/HTML-Restrict They should be https://metacpan.org/module/HTML::Restrict Having said that, you're creating the module links from distribution names, but they really should be created from module names. (See @monken's comment above). Here's an example of how I scroll through modules (and Pod): https://github.com/oalders/iCPAN/blob/master/perl/lib/iCPAN.pm#L522 @monken can probably give you a better example, though. One other issue is that the test should run against our test ES rather than on a subset of results from the production machine. I removed a fair number of tests as they've mostly been made redundant via Moose and MooseX::StrictConstructor. If you have any questions about the conversion to Moose (or any other changes) feel free to hit me up. Probably commenting on the commits is a good way to manage that conversation. As far as the download links are concerned, the issue is that the sitemap is for Google to easily find the pages which we want indexed. Download links won't help with SEO, so while they're helpful to have, they're not fixing the problem at hand, which is that Google doesn't like us very much. :) Thanks again! As soon as we get the modules and the test sorted, I think we can merge this. |
| @oalders Please could you look at this again now the big Bootstrap stuff is out of the way - it might help with the speed of the site if we can feed google the lists instead if it crawling |
| Actually, looks like I can just close it, so that's what I've done. :) |
This introduces a script called bin/generate_sitemap.pl that creates XML files
containing URLs for authors, releases and modules; a module that does all of
the heavy lifting; a test script that exercies the module; and an updated
robots.txt file.