Open content search (help please!)

Chriswaterguy's picture

I want to create a custom search engine that searches as many open content sites as possible. I want something that finds material that can be used freely - for Appropedia, Wikipedia, or any purpose you might want. This means only those sites which allow material to be used and modified, not only accessed. The list of sites used in the search engine will also be made public, in the spirit of open source.

This means only content which is: Public domain, GFDL, or CC-by-sa, or something very similar. (As for any for non-commercial license, such as CC-sa-nc, or any open access site... that's nice, but not good enough for what we want here.)

(EDIT: I've just realized, no CC licenses are suitable for using in a GFDL resource like Appropedia. I had misinterpreted the "or a compatible license". Apparently there are differences in the underlying legal text. The two groups are working on making the licenses compatible - let's hope they do so quickly!)

Now, when I go searching for listings of public domain sites, I find lots of lists, but the sites given are usually *not* public domain as claimed but mainly just open access. So my first question is: does anyone have a list of actual public domain and/or other open content?

If not... we have to start from scratch. So... what sites can we add? Here are the ones I've listed so far at Appropedia open content search:

  • .gov
  • .mil
  • .appropedia.org
  • .wikipedia.org
  • .wikibooks.org
  • .wikiversity.org
  • .gutenberg.org
  • .wikisource.org

www.ethanzuckerman.com/blog/* (CC license)

(Now, the .gov sites, in particular, are a mixed bag. Lots of public domain material, but unless the content was actually created by a federal government employee, it will probably be under copyright. A more sophisticated search of US government sites will improve things: either list every single federal government site, or exclude all state government sites. I think the second one is the better option - just list all the states' domains: -*alabama.gov, -*arkansas.gov, etc. There's still the issue of contractors and government supported laboratories etc who publish content on federal sites, but that will just have to be watched out for.)

Chris

curtbeckmann's picture

Does fao.org fit the

Does fao.org fit the profile?

CurtB, Appropedia guy

Chriswaterguy's picture

No

Sadly no - it's NC. i.e. they specify non-commercial use, so it can't be used under a license which is not NC.

I don't think they explicitly allow derivatives either. From memory a lot of (all?) UN stuff is like this.

Chris

Chriswaterguy's picture

Open content blogs

Thankfully there are some bloggers who use open content licenses - but not many. So all this great content is being created on some of the best blogs, and is not able to be easily reused.

Ethan Zuckerman is already listed. Doing a search brings up more blogs *about* open licenses than *under* open licenses, but I did find Oilgae.com under GFDL. I added "sustainability" to the search - no point getting a GFDL blog about Britney Spears :D. Didn't look much further as my internet went from slow to dead last night.

The ideal would be to have a list of all blogs under such licenses, and if we found that the ones irrelevant to our purposes were giving false positive search results (popping up when they weren't actually useful) we could think about splitting the list into sustainability, international development, and other.

I fear all this may be reinventing the wheel though. Surely someone else has put such lists together?

Chris

Google usage rights limiter in advanced search

Hey all,
Working with the Appropriate Technology Search cse, there is a way to refine the results to filter by license, but I'm having trouble implementing because I don't know the "freehand" for the advanced search syntax... should be something like:
water and usage rights: cc
etc.
But I can't find a reference for it... still experimenting.
http://www.google.com/coop/cse?cx=009941892632664145530%3Aglw3czclbti
http://tinyurl.com/yste98
http://tinyurl.com/yste98

Refinement example here:
http://www.google.com/custom?hl=en&client=google-coop-np&cof=FORID:1%3BA...

search syntax example:
water more:open_content_results

obviously not correct!!
More as it comes,
W

Chriswaterguy's picture

That's a great feature! It

That's a great feature! It only finds CC content, but until GFDL and PD have similar tags, that's unavoidable.

I went looking for places to ask help on Google search - this Google Group looks like a good place to start. I also want to ask how to exclude particular subdomains (e.g. *.alabama.gov) when searching a domain (*.gov). Busy right now but will check later, if no one else has done it.

Lonny's picture

I used this in the google

I used this in the google search box and it seemed to work well:
*.gov -site:.nasa.gov

I tested it with: *.gov -site:.nasa.gov nasa, and did not find any NASA sites.

Chriswaterguy's picture

Yes, that works, and will be

Yes, that works, and will be useful. But in this particular case, I'm not sure about putting in hundreds of state and local government sites that way.*

 

Turns out there is another way to exclude subdomains, which I'd overlooked - an answer to my question on Google groups said:

    Go back to the "Sites" tab of the CSE control panel.
    Beneath your list of included sites, notice the "Excluded sites" section.
    Click on "Exclude sites", and specify: alabama.gov...
 
  • 50-odd states & territories are not so bad, but I just learnt that US local government sites also have their own subdomains, rather than being in the state subdomain, making it a bigger challenge than I thought - gotta find a complete list...

- - - -
Chriswaterguy

Chriswaterguy's picture

I've asked already

I got impatient :D

   * Usage rights (CC licenses)
   * searching a domain, but excluding certain subdomains

Hoping there are answers...

- - - -
Whatever you can do, or dream you can, begin it! Boldness has genius, power, and magic in it.
–Goethe

curtbeckmann's picture

I believe that if you

I believe that if you attempt a search using the Google Advanced page, when you get to the results window, the decorated "freehand" version of your search will appear in the search box at the top of the results page. So, if you know how to initiate the search using the advanced search window, it should tell you how to do the equivalent search in freehand. No?

CurtB, Appropedia guy

Chriswaterguy's picture

No, not for this search - it

No, not for this search - it gives radio buttons instead.
- - - -
Chriswaterguy

Chriswaterguy's picture

More ways of looking for CC

More ways of looking for CC content are listed here. This ability to search is a *very* good feature of CC licenses.

Chris

Post new comment

The content of this field is kept private and will not be shown publicly.
  • <ref>, </ref> around notes. <nowiki>, </nowiki> around text not to be formatted.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <sup> <h1> <h2> <h3> <h4> <h5> <big> <small> <img> <ref> </ref> <references> <nowiki> </nowiki>
  • Allow MediaWiki syntax (limited).
  • Images can be added to this post.
  • Handle "[[Page]]" (wikis), [http://... link text] (exts)
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.
  • You may use [inline:xx] tags to display uploaded files or images inline.
  • You can use context links in the text to create context-related links to pages or sites that provide additional information about a word or phrase.

More information about formatting options

Captcha
This question is for testing whether you are a human visitor and to prevent automated spam submissions.