
I want to create a custom search engine that searches as many open content sites as possible. I want something that finds material that can be used freely - for Appropedia, Wikipedia, or any purpose you might want. This means only those sites which allow material to be used and modified, not only accessed. The list of sites used in the search engine will also be made public, in the spirit of open source.
This means only content which is: Public domain
, GFDL
, or CC-by-sa
, or something very similar. (As for any for non-commercial license, such as CC-sa-nc, or any open access site... that's nice, but not good enough for what we want here.)
(EDIT: I've just realized, no CC licenses are suitable for using in a GFDL resource like Appropedia. I had misinterpreted the "or a compatible license".
Apparently there are differences in the underlying legal text. The two groups are working on making the licenses compatible - let's hope they do so quickly!)
Now, when I go searching for listings of public domain sites, I find lots of lists, but the sites given are usually *not* public domain as claimed but mainly just open access. So my first question is: does anyone have a list of actual public domain and/or other open content?
If not... we have to start from scratch. So... what sites can we add? Here are the ones I've listed so far at Appropedia open content search:
www.ethanzuckerman.com/blog/* (CC license)
(Now, the .gov sites, in particular, are a mixed bag. Lots of public domain material, but unless the content was actually created by a federal government employee, it will probably be under copyright. A more sophisticated search of US government sites will improve things: either list every single federal government site, or exclude all state government sites. I think the second one is the better option - just list all the states' domains: -*alabama.gov, -*arkansas.gov, etc. There's still the issue of contractors and government supported laboratories etc who publish content on federal sites, but that will just have to be watched out for.)
Chris
Does fao.org fit the
Does fao.org fit the profile?
CurtB, Appropedia guy
No
Sadly no - it's NC. i.e. they specify non-commercial use, so it can't be used under a license which is not NC.
I don't think they explicitly allow derivatives either. From memory a lot of (all?) UN stuff is like this.
Chris
Open content blogs
Thankfully there are some bloggers who use open content licenses - but not many. So all this great content is being created on some of the best blogs, and is not able to be easily reused.
Ethan Zuckerman is already listed. Doing a search brings up more blogs *about* open licenses than *under* open licenses, but I did find Oilgae.com under GFDL. I added "sustainability" to the search - no point getting a GFDL blog about Britney Spears :D. Didn't look much further as my internet went from slow to dead last night.
The ideal would be to have a list of all blogs under such licenses, and if we found that the ones irrelevant to our purposes were giving false positive search results (popping up when they weren't actually useful) we could think about splitting the list into sustainability, international development, and other.
I fear all this may be reinventing the wheel though. Surely someone else has put such lists together?
Chris
Google usage rights limiter in advanced search
Hey all,
Working with the Appropriate Technology Search cse, there is a way to refine the results to filter by license, but I'm having trouble implementing because I don't know the "freehand" for the advanced search syntax... should be something like:
water and usage rights: cc
etc.
But I can't find a reference for it... still experimenting.
http://www.google.com/coop/cse?cx=009941892632664145530%3Aglw3czclbti
http://tinyurl.com/yste98
http://tinyurl.com/yste98
Refinement example here:
http://www.google.com/custom?hl=en&client=google-coop-np&cof=FORID:1%3BA...
search syntax example:
water more:open_content_results
obviously not correct!!
More as it comes,
W
That's a great feature! It
That's a great feature! It only finds CC content, but until GFDL and PD have similar tags, that's unavoidable.
I went looking for places to ask help on Google search - this Google Group
looks like a good place to start. I also want to ask how to exclude particular subdomains (e.g. *.alabama.gov) when searching a domain (*.gov). Busy right now but will check later, if no one else has done it.
I used this in the google
I used this in the google search box and it seemed to work well:
*.gov -site:.nasa.govI tested it with:
*.gov -site:.nasa.gov nasa, and did not find any NASA sites.Yes, that works, and will be
Yes, that works, and will be useful. But in this particular case, I'm not sure about putting in hundreds of state and local government sites that way.*
Turns out there is another way to exclude subdomains, which I'd overlooked - an answer to my question on Google groups said:
Go back to the "Sites" tab of the CSE control panel. Beneath your list of included sites, notice the "Excluded sites" section. Click on "Exclude sites", and specify: alabama.gov...
- - - -
Chriswaterguy
I've asked already
I got impatient :D
Hoping there are answers...
- - - -
Whatever you can do, or dream you can, begin it! Boldness has genius, power, and magic in it.
–Goethe
I believe that if you
I believe that if you attempt a search using the Google Advanced page, when you get to the results window, the decorated "freehand" version of your search will appear in the search box at the top of the results page. So, if you know how to initiate the search using the advanced search window, it should tell you how to do the equivalent search in freehand. No?
CurtB, Appropedia guy
No, not for this search - it
No, not for this search - it gives radio buttons instead.
- - - -
Chriswaterguy
More ways of looking for CC
More ways of looking for CC content are listed here
. This ability to search is a *very* good feature of CC licenses.
Chris
Post new comment