Skip to content

Conversation

@talexb
Copy link
Contributor

@talexb talexb commented Aug 19, 2013

This introduces a script called bin/generate_sitemap.pl that creates XML files
containing URLs for authors, releases and modules; a module that does all of
the heavy lifting; a test script that exercies the module; and an updated
robots.txt file.

talexb added 3 commits August 19, 2013 12:32
This introduces a script called bin/generate_sitemap.pl that creates XML files containing URLs for authors, releases and modules; a module that does all of the heavy lifting; a test script that exercies the module; and an updated robots.txt file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use FindBin, so we can call this script from any cwd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@monken
Copy link
Contributor

monken commented Aug 20, 2013

Great work! That gives us a good starting point. We need to think about what we want to include in the sitemap, though. Right now, the script includes all modules, i.e. unauthorized, backpan, not latest. Same for releases.
In order to increase our relevance on google, I would suggest we only put the latest modules and releases in the sitemap. This also means that updates to those modules will be detected faster by google since it doesn't have to refresh hundreds of thousands of pages.

@talexb
Copy link
Contributor Author

talexb commented Aug 20, 2013

I'm happy to filter the search so that the sitemap is better; let me know how the search can be adjusted and I'll change it.

talexb added 3 commits August 20, 2013 15:14
Previous version dumped all releases; releases are now limited to just the latest ones, bringing that file in line with the size of the others.
@monken
Copy link
Contributor

monken commented Aug 28, 2013

Hi, I might be missing something, but why do we have a objectType distribution and release?
I figured we wanted to have a

  • module.xml.gz which includes all modules (e.g. search on the file type, reduced to authorized, indexed and latest modules).
  • release.xml.gz which includes all latest releases (search the release type for latest releases)
  • authors.xml.gz, the way it is right now

The code seems to generate a releases.xml.gz with the download_url. Is that intentionally?
Also, the modules.xml.gz would generate links such as /module/DBIx-Class which do not exist (it should be /module/DBIx::Class). So we are searching on the wrong type here.

I haven't actually run the script, so all of the above what I understood from the source code.

@talexb
Copy link
Contributor Author

talexb commented Aug 28, 2013

@monken I wasn't sure about exactly what I should be picking out from the data structure, so went with what made sense to me. I'll be glad to change it if I have it wrong.

@oalders
Copy link
Member

oalders commented Sep 14, 2013

@talexb First off, thanks very much for all of the work you put into this. I apologize that I did not look at this sooner. Can you have a look at the changes I made to your branch here? https://github.com/CPAN-API/metacpan-web/tree/oalders/talexb/addSitemap

There's still one issue to fix -- the module links. Currently they look like:

https://metacpan.org/module/HTML-Restrict

They should be https://metacpan.org/module/HTML::Restrict

Having said that, you're creating the module links from distribution names, but they really should be created from module names. (See @monken's comment above). Here's an example of how I scroll through modules (and Pod):

https://github.com/oalders/iCPAN/blob/master/perl/lib/iCPAN.pm#L522

@monken can probably give you a better example, though.

One other issue is that the test should run against our test ES rather than on a subset of results from the production machine.

I removed a fair number of tests as they've mostly been made redundant via Moose and MooseX::StrictConstructor. If you have any questions about the conversion to Moose (or any other changes) feel free to hit me up. Probably commenting on the commits is a good way to manage that conversation.

As far as the download links are concerned, the issue is that the sitemap is for Google to easily find the pages which we want indexed. Download links won't help with SEO, so while they're helpful to have, they're not fixing the problem at hand, which is that Google doesn't like us very much. :)

Thanks again! As soon as we get the modules and the test sorted, I think we can merge this.

@talexb
Copy link
Contributor Author

talexb commented Sep 15, 2013

Just having a look at the code -- seeing how you've re-engineered the code to use Moose is brilliant.

The test script has a small typo -- it should use MetaCPAN::Sitemap; after fixing that, I see the following URLs from the distribution search:

https://metacpan.org/module/-0.01
https://metacpan.org/module/0.0.3
https://metacpan.org/module/0.01
https://metacpan.org/module/0.05
https://metacpan.org/module/0.11
https://metacpan.org/module/0.21
https://metacpan.org/module/155
https://metacpan.org/module/20120109-NoSQL_and_MongoDB
https://metacpan.org/module/5.003_07-2.U
https://metacpan.org/module/5foldCV
https://metacpan.org/module/AA
https://metacpan.org/module/AAAA-Crypt-DH
https://metacpan.org/module/AAAA-Mail-SpamAssassin
https://metacpan.org/module/AAAAAAAAA
https://metacpan.org/module/AAC-Pvoice
https://metacpan.org/module/ABI
https://metacpan.org/module/ABNF-Grammar
https://metacpan.org/module/AC-DC
...

Some of these last ones (once I map '-' to '::') come up with a valid URL, but the ones with just version numbers produce a 'Not Found' message from MetaCPAN. Similarly, some of the ones at the end ('zex', 'zfilter') are not found.

I have pushed my updates to my repo. Let's discuss this soon. Thanks!

@ghost ghost assigned oalders Dec 21, 2013
@ranguard
Copy link
Member

ranguard commented Feb 5, 2014

@oalders Please could you look at this again now the big Bootstrap stuff is out of the way - it might help with the speed of the site if we can feed google the lists instead if it crawling

@oalders
Copy link
Member

oalders commented Mar 15, 2014

@talexb Could you cancel this pull request? We've moved your commits over to #1126 and they will be merged very shortly. Thanks again for all the work you've put into this. It'll be great to have it in production.

@oalders oalders closed this Mar 15, 2014
@oalders
Copy link
Member

oalders commented Mar 15, 2014

Actually, looks like I can just close it, so that's what I've done. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants