Generate sitemap for metacpan.org website #924

talexb · 2013-08-19T16:37:21Z

This introduces a script called bin/generate_sitemap.pl that creates XML files
containing URLs for authors, releases and modules; a module that does all of
the heavy lifting; a test script that exercies the module; and an updated
robots.txt file.

This introduces a script called bin/generate_sitemap.pl that creates XML files containing URLs for authors, releases and modules; a module that does all of the heavy lifting; a test script that exercies the module; and an updated robots.txt file.

monken · 2013-08-20T05:42:22Z

bin/generate_sitemap.pl

Please use FindBin, so we can call this script from any cwd

monken · 2013-08-20T05:46:06Z

Great work! That gives us a good starting point. We need to think about what we want to include in the sitemap, though. Right now, the script includes all modules, i.e. unauthorized, backpan, not latest. Same for releases.
In order to increase our relevance on google, I would suggest we only put the latest modules and releases in the sitemap. This also means that updates to those modules will be detected faster by google since it doesn't have to refresh hundreds of thousands of pages.

talexb · 2013-08-20T12:23:10Z

I'm happy to filter the search so that the sitemap is better; let me know how the search can be adjusted and I'll change it.

Previous version dumped all releases; releases are now limited to just the latest ones, bringing that file in line with the size of the others.

monken · 2013-08-28T03:17:07Z

Hi, I might be missing something, but why do we have a objectType distribution and release?
I figured we wanted to have a

module.xml.gz which includes all modules (e.g. search on the file type, reduced to authorized, indexed and latest modules).
release.xml.gz which includes all latest releases (search the release type for latest releases)
authors.xml.gz, the way it is right now

The code seems to generate a releases.xml.gz with the download_url. Is that intentionally?
Also, the modules.xml.gz would generate links such as /module/DBIx-Class which do not exist (it should be /module/DBIx::Class). So we are searching on the wrong type here.

I haven't actually run the script, so all of the above what I understood from the source code.

talexb · 2013-08-28T21:19:00Z

@monken I wasn't sure about exactly what I should be picking out from the data structure, so went with what made sense to me. I'll be glad to change it if I have it wrong.

oalders · 2013-09-14T03:40:19Z

@talexb First off, thanks very much for all of the work you put into this. I apologize that I did not look at this sooner. Can you have a look at the changes I made to your branch here? https://github.com/CPAN-API/metacpan-web/tree/oalders/talexb/addSitemap

There's still one issue to fix -- the module links. Currently they look like:

https://metacpan.org/module/HTML-Restrict

They should be https://metacpan.org/module/HTML::Restrict

Having said that, you're creating the module links from distribution names, but they really should be created from module names. (See @monken's comment above). Here's an example of how I scroll through modules (and Pod):

https://github.com/oalders/iCPAN/blob/master/perl/lib/iCPAN.pm#L522

@monken can probably give you a better example, though.

One other issue is that the test should run against our test ES rather than on a subset of results from the production machine.

I removed a fair number of tests as they've mostly been made redundant via Moose and MooseX::StrictConstructor. If you have any questions about the conversion to Moose (or any other changes) feel free to hit me up. Probably commenting on the commits is a good way to manage that conversation.

As far as the download links are concerned, the issue is that the sitemap is for Google to easily find the pages which we want indexed. Download links won't help with SEO, so while they're helpful to have, they're not fixing the problem at hand, which is that Google doesn't like us very much. :)

Thanks again! As soon as we get the modules and the test sorted, I think we can merge this.

talexb · 2013-09-15T19:55:44Z

Just having a look at the code -- seeing how you've re-engineered the code to use Moose is brilliant.

The test script has a small typo -- it should use MetaCPAN::Sitemap; after fixing that, I see the following URLs from the distribution search:

Some of these last ones (once I map '-' to '::') come up with a valid URL, but the ones with just version numbers produce a 'Not Found' message from MetaCPAN. Similarly, some of the ones at the end ('zex', 'zfilter') are not found.

I have pushed my updates to my repo. Let's discuss this soon. Thanks!

ranguard · 2014-02-05T12:58:51Z

@oalders Please could you look at this again now the big Bootstrap stuff is out of the way - it might help with the speed of the site if we can feed google the lists instead if it crawling

oalders · 2014-03-15T10:11:58Z

@talexb Could you cancel this pull request? We've moved your commits over to #1126 and they will be merged very shortly. Thanks again for all the work you've put into this. It'll be great to have it in production.

oalders · 2014-03-15T10:12:28Z

Actually, looks like I can just close it, so that's what I've done. :)

talexb added 3 commits August 19, 2013 12:32

Add XML::Simple to requirements

753e10c

Also add ElasticSearch

5bad125

monken reviewed Aug 20, 2013
View reviewed changes

Use FindBin to get lib path

3f07783

talexb added 3 commits August 20, 2013 15:14

Limit release to latest ones only

f2fd447

Previous version dumped all releases; releases are now limited to just the latest ones, bringing that file in line with the size of the others.

Create xml.gz files, not just xml files

ba01923

Add PerlIO::gzip to cpanfile

f2c55a6

oalders mentioned this pull request Aug 28, 2013

Create sitemap files for search engines #894

Closed

ghost assigned oalders Dec 21, 2013

oalders closed this Mar 15, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Generate sitemap for metacpan.org website #924

Generate sitemap for metacpan.org website #924

Uh oh!

talexb commented Aug 19, 2013

monken Aug 20, 2013

talexb Aug 20, 2013

monken commented Aug 20, 2013

talexb commented Aug 20, 2013

monken commented Aug 28, 2013

talexb commented Aug 28, 2013

oalders commented Sep 14, 2013

talexb commented Sep 15, 2013

ranguard commented Feb 5, 2014

oalders commented Mar 15, 2014

oalders commented Mar 15, 2014

Labels

4 participants

Generate sitemap for metacpan.org website #924

Generate sitemap for metacpan.org website #924

Uh oh!

Conversation

talexb commented Aug 19, 2013

monken Aug 20, 2013

Choose a reason for hiding this comment

talexb Aug 20, 2013

Choose a reason for hiding this comment

monken commented Aug 20, 2013

talexb commented Aug 20, 2013

monken commented Aug 28, 2013

talexb commented Aug 28, 2013

oalders commented Sep 14, 2013

talexb commented Sep 15, 2013

ranguard commented Feb 5, 2014

oalders commented Mar 15, 2014

oalders commented Mar 15, 2014

Labels

4 participants