Making a sitemap (opensource documentation wild goose chase)

Lonny's picture

This is my first blog entry on Appropedia. Eventually I will make more personal blog entries, but it seems like this will be a good place to chronicle some of the mediawiki fun we get to have. Hopefully it will help others as well. Let me know what you think.

So I wanted an Appropedia sitemap to submit to search engines. I am still unsure if this is necessary since we are indexed by all the major search engines and have a robots.txt to tell the well behaved spiders where to go, but reviewing the available literature makes it seem like - yes, we want a sitemap. Last year we implemented the MediaWiki (MW) extension Google Sitemap. It took a few hours of hacking, but eventually the extension made a sitemap every time you visited Special:GoogleSitemap. One limitation was that the sitemap was limited to 5000 entries, a limit we eventually passed. After upgrading to a newer version of MW the extension exhibited a much stronger limitation... it broke some things, e.g. rendering Special:Version as blank.

Well it turns out MW has a built in sitemap generator (since MW 1.6), but I could not figure it out. It is amazing how little information is available about it, but I finally have, with the help of a few other sites and Curt, figured it out. It took me a while to figure it out because of the lack of available information and two wrong assumptions on my part.

My first erroneous assumption was that we needed to have only one sitemap file... so I spent way to long trying to combine the many lines outputted from the MW sitemap script. That assumption is wrong as described at this hard to find page at Google Support.

My second big error was forgetting to update the apache rewrite rules while testing the new sitemap with Google. I kept getting an error, which I tried, in futility, to correct in the code of the sitemaps. So I added the following rule to .htaccess
RewriteCond %{REQUEST_URI} !^/sitemap* that allows search engines to access the sitemap index and files and it worked!

The attached file (see notes below for a breakdown of the meaning of the code) is a simplification and adaption that works for Appropedia based on the great code at jrandomhacker.info and the great blog entry at dralspire.com. This adapted code is based on a few assumptions and changes, which may not work for you. I dropped the file onto our root directory and am run it with a cron file (or just log in to the server and csh FileName).

Once you have the sitemap, you can go to the following sites to submit the sitemaps to search engines. You could write a script to automate this (see dralspire.com), but we will be updating the sitemap nightly and I don't want to ping the search engines that often. The search engines seem to update fairly often on their own.

http://www.google.com/webmasters/tools/ping?sitemap=http://www.appropedia.org/sitemap.xml
http://search.yahooapis.com/SiteExplorerService/V1/ping?sitemap=http://www.appropedia.org/sitemap.xml
http://submissions.ask.com/ping?sitemap=http://www.appropedia.org/sitemap.xml
http://api.moreover.com/ping?u=http://www.appropedia.org/sitemap.xml
http://webmaster.live.com/ping.aspx?siteMap=http://www.appropedia.org/sitemap.xml

Some notes:

Meanings of the code in the attached file:

  • The echo command just prints the results to the screen.
  • cd maintenance/ changes from the root to the maintenance directory where the MW sitemap script lives.
  • /usr/local/php5/bin/php generateSitemap.php runs the script using php5 (you may try just php generateSitemap.php).
  • mv -f sitemap* ../ moves all the generated files to the root directory.
  • cd .. changes back to the root directory.
  • sed 's//http:\/\/www.appropedia.org\//g' sitemap-index-appropedia-w1.xml > sitemap.xml adds http://www.appropedia.org/ before all of the sitemap addresses in the sitemap index and saves as a new file - sitemap.xml. Replace www.appropedia.org with your own sitename.

To remove the old google sitemap extension:

  • Removed, from the LocalSettings.php, the following line - require_once("$IP/extensions/GoogleSitemap.php");
  • Deleted GoogleSitemap.php from the directory - extensions/
  • Deleted SpecialGoogleSitemap.php from the directory - includes/

Once these forums and blogs are out of beta, we can add a second sitemap to the root directory. I know Google will accept that. Or we can add a new sitemap file to the index (sitemap.xml), which I think will be the best plan.

Comments

hey o.O that's weird -

hey o.O that's weird - what's with the "inline:fixsitemap.txt=attached" URL of the attached file... FF doesn't know how to handle "inline:"

Lonny's picture

Inline error

Hi. Thanks for the note. I do not know what is wrong with that code, but I am sure Curt or Chris can fix it. For now, I got rid of the offending line.

Lonny's picture

Still problems

Looks like the script is only fixing the main sitemap.xml and not all the namespace ones. Look for a way to fix that!!!

Thanks for the reference

Hey, thanks for the reference. I'm going to be revisiting this topic soon, and I'll come back and let you know how things have gone.

Lonny's picture

Thanks for the code

I look forward to hearing what comes of your revisit. Thank you for sharing so much of your work at http://jrandomhacker.info/.

Updates for 1.12.0+?

It looks like things broke somewhere along the line. Have you had a chance to re-test things since these various MediaWiki and Google upgrades?

Something somewhere seems to be broken for me with my system, are things working for you?

Lonny's picture

nope

I have not had a chance to retest... except that it seems more broken than before. I am sure that there is a simple fix, but I do not know what it is. It also seems that most of the search engines are finding new Appropedia pages quickly, so this has become a lower (but still a) priority for us. Please let us know if you figure anything out, and I will post once we get to it as well.

Thank you.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • <ref>, </ref> around notes. <nowiki>, </nowiki> around text not to be formatted.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <sup> <h1> <h2> <h3> <h4> <h5> <big> <small> <img> <ref> </ref> <references> <nowiki> </nowiki>
  • Allow MediaWiki syntax (limited).
  • Images can be added to this post.
  • Handle "[[Page]]" (wikis), [http://... link text] (exts)
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.
  • You may use [inline:xx] tags to display uploaded files or images inline.
  • You can use context links in the text to create context-related links to pages or sites that provide additional information about a word or phrase.

More information about formatting options

Captcha
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
18 + 1 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.