Remaking Google Uncle Sam (Sort Of) With Mojeek and a CISA Dataset

Remaking Google Uncle Sam (Sort Of) With Mojeek and a CISA Dataset

Do you remember Google Uncle Sam? I can’t blame you if you don’t; Google killed it off in 2011. It was a specialty search offering of Google’s that restricted its results to .gov sites. Very useful! There was a lot of grumbling after it was shut down; in fact, I made a replacement in 2012 that’s still available now. (I used Google Custom Search and URL patterns.)

As I map out how I want my replacement for Google Web Alerts to work, I keep thinking about things I want to monitor, like parts of the .gov Web space. And of course the best way to figure out how to monitor a Web space is figuring out how to SEARCH a Web space. So I made a Mojeek-and-CISA-powered .gov space search tool.

The Google Uncle Sam replacement I made in 2012 relied on consistent domain patterns used in government sites — things like http://www.xx.co.us denoting a county web site and http://www.xx.ci.us denoting a city Web site. That means users can restrict results to city or county sites but the search relies on the Web sites to use the URL patterns I’m searching for. If I wanted to make sure I was including all available gov sites I would have to have an official list.

Well, lookie here! CISA has an official list of gov sites on GitHub. It’s a basic CSV file so it’s easy for a JavaScript program to grab and start filtering.

A screenshot of a CSV file showing the data headers (Domain name, Domain type, agency, and state) and some sample data beneath from Alaska.

The dataset includes a parameter of “domain type” (city, county, state, etc.), which gives you a few options. Want to search all the city .gov web spaces in North Carolina for landslide? No problem. (I’m using Mojeek’s lowest API tier, which provides only 10 results at a time and has strict rate limits. That’s why the results display sets; I can only search 25 domains at a time and it’s less boring to watch the results populate the page one API call at a time than to wait through the rate limits and show all the results at once.)

A screnshot of a .gov Web space search run through Mojeek. The search is filtering for North Carolina city sites, and shows results for the query landslide.

You can mix options to do something like search all the tribal gov Web spaces in California:

A screenshot showing Mojeek/gov search results for Tribal sites in California. The Mojeek query is intitle:water.

The more I use Mojeek’s API the happier I am with the focus parameter. The increasing amount of infosewage online means that defining tight search spaces is only going to get more important. I think my next step is going to be reviewing the College Data Scorecard and making some sets of university Web sites. Being able to do monitor a search as general as “new database” or “digital archive” and restrict it to a collection of research universities? Yes please!

Back To Top