Open Data is a powerful tool that can be used for business assessment, predictive analytics, marketing, or just exploration of the world around us. In Part One we laid out the case for Open Data and its usefulness in the pursuit of both Public Good and business benefit. Part Two focuses on proper licensing and the use of Open Data “The Right Way” as well as the benefits of open government data.
What Data am I Allowed to Use?
Several years ago, I spent a month in India setting up an analytics centre for a major US joint venture. With not much to do at night (you can only practice sitar so long before you get thrown out – ask my wife!), my mates and I in the shared apartment turned to app coding and data exploration.
This was right around the time of the WikiLeaks scandal, and Julian Assange was promoting a massive set of hacked US military communiques as Public (has-the-right-to-know) Data. Foolish and dangerous to say the least. A young housemate – from a different company, mind you — wanted some help loading this information into a database for analysis.
“It’s Open Data, right? I found it on the Internet.” A quick redirection for this kid, we shut his computer down and took a walk to the streetside chaiwala for some tea and bhaji. I briefly explained the difference between all sorts of Internet data, and how you can and should use it. I think he got the point.
Nearly all information you find online is public, that’s true. However, that data is typically protected by Terms of Use or licenses that limit how that data can be legally used. Remember that click-through legal language you said “yes” to when you signed up for that new social website?
The Terms language defines what the company, and what the public, can do with the data you submit to that site. In most cases the Terms are quite restrictive and somewhat favorable to you as the end consumer. Third parties are not typically allowed to take that information in its full form and re-use it. Either you, or the site, or sometimes both parties, own the copyright to that information.
This is not Open Data, it’s probably what we call Restrictive Terms of Use Data in our shop.
General License and Terms Types
For data that comes from government sources or public research departments, you will most often find an actual license that governs what can be done with it. This should be contained within the downloaded file, or possibly as a main link on the Open Data Portal you’re viewing. Most non-government sites will instead offer or even require you to click on a Terms of Use agreement.
Here are a few common license and terms types we see and what you can do with them. NOTE: this information is not in any way legal advice – you or your company are ultimately responsible for your actions with data you locate and use.
If you’re considering a commercial endeavor using this data, you’d do best to retain a skilled IP lawyer who can walk you through these concepts.
1. Open Data: most often allows full use of the data, including for commercial use. See an example of Canada’s Open Government License
2. Creative Commons: a flexible license that allows the author to determine what may be done with the information. Most Creative Commons License works do allow for commercial use. Nearly all require attribution of the original author in derivative works.
3. Custom Attribution Terms: the terms vary with these purpose-built agreements, but the general theme is that the work may be used for either commercial or non-commercial purposes but require attribution.
4. Restrictive Terms of Use: This information is nearly always found in the footer of a website. READ THESE TERMS prior to even downloading data. These terms vary widely, based both on the intent of the company and at times the forward-thinking nature of their legal counsel. Typically these terms restrict copying and commercial use.
5. API Terms of Service: an emerging trend is for Internet companies to offer their data via API, for a reasonable per-record fee. This may be a mix of public and private data. As long as you follow the rules in the Terms of Service related to reselling, display of ads, and copyright, you can get a lot of information. Do note that most of the major API players (google, Bing, etc.) specifically prohibit the caching of the data pulled from the API, which means you probably can’t set up a custom search engine and put the results into a database. Review those terms carefully before building a system dependent on these seemingly public sources.
6. No Terms of Use: This is a tricky one. The website owner missed the step of defining the rules for content re-use. Use extreme caution when considering the use of this data – someone out there created this information and owns the right to determine its lawful use. Maybe the site owner doesn’t care about the copyright of their own data, or doesn’t have a legal framework in place. In either of these cases, do you really trust the data itself? Maybe they are hiding the fact that they don’t own the copyright. Do you trust your ability to use this information without legal risk?
Mr. Assange’s (or Chelsea Manning’s, for that matter) miserable stunt of gathering classified material and publishing it doesn’t even bear discussion for licensing. It’s illegal, and extremely harmful. It also serves as a big setback to those of us in the community who are advocating for large-scale, ethical use of government data. Bad actors — whether they are into espionage or are simply Content Trolls with web scraping tools trying make a buck — do a major disservice to the Open Data movement, and that eventually harms the public by slowing the release of new data into the public domain.
But I’m Just Doing Research, and Not Selling Anything. Why should I care?
Again, no legal advice here. What we have learned in our own research is that Terms and Licenses will typically spell out every use possible for the data you’re looking at. You should also familarize yourself with the very important concepts of Fair Use (US) and Fair Practice (Canada and Commonwealth states) that define what you can and cannot do with someone else’s data, even for non-commercial purposes.
Typically these rules boil down to how much data you can gather, what defines a derivative work, and at what point you are misusing someone else’s work. In the end, if it’s not specifically Open Data, you are running a risk of violating someone’s rights by using their data without permission.
We see this quite often in the student research community: “I pulled 100,000 stock quotes and ran a regression analysis to pick a better market strategy” is a common theme in many Master’s-level statistics papers.
The question to ask yourself: did you review the Terms? Are there limitations on the use of the data? Did you follow the correct attribution approach required by the copyright holder?
If you can answer these questions, you’ve done your data homework. If you never even thought to check, then you’re not showing the diligence required for true research work.
Setting aside the “The Right Way” argument for a moment, there’s a very practical reason to be diligent with permissions, even for non-commercial work. Just because you’re doing internal research or a school report today doesn’t mean that you won’t have another use for this work down the road.
A new product idea could emerge, and you might want to patent it or bring it to market. What will you do when your company or new investor finds out you’ve based your product on content you don’t own? Valuation down the drain, and your product is cancelled.
That research report might turn into a larger work that you want to submit to a journal. Do you really want your references to point to a website that on inspection has unclear data provenance, or worse yet shows that you violated copyright in the creation of your work? Kiss that fellowship goodbye…
If it seems like a tough stance, it probably is. But using this approach will ensure you’re working with the right kind of data, from an authorized source, with the correct permission. You’ll be able to look anyone in the eye and say, “I obtained my data The Right Way”.
If you’re a business wanting to use Open Data and working with a firm, or a site, that can’t tell you where each piece of data came from, you’re at risk. Similar to the risk of drinking an hours-old Mango Lassi from that street vendor with no refrigeration – but with a stomach-ache that lasts MUCH longer.
But What if my Government’s Not Playing The Open Data Game?
Sad to say: Illinois is one of the greatest states in the US, but also one of the least open in terms of public data from the state government. Leave aside the juicy stories about The Shackled Gov(s), “Da Mare”, and a long history of political corruption for a moment.
The State of Open Palms and Closed Data (OK, couldn’t resist!) is one of only a few in North America with a statewide law that allows business data sharing with the public – but only for a fee, and often with significant restrictions on reuse. Want to find another fee-based region, here up north? Our favored Province of Ontario.
So what do you do when you need to gather this information, but can’t? Build relationships – data relationships. If you do the research to get this information from other Open Data sources (they are out there), and spend solid time cleaning and linking the records, you’ll find what you’re looking for.
This is how we’ve been able to reveal the real names behind so many Numbered Corporations here in Canada. Look for Federal, Municipal, and cross-provincial data. The Chicago region has two great examples of this regional/municipal data strategy that can surely help Illinoisans gather data that otherwise might not be available from the State: the City of Chicago’s new Open Data Portal , and Cook County’s government data site.
Open Data, Done Right
This is the rel8ed.to story for Advanced Data: Find it, refine it, link it, relate it, mine it. Do it legally. Be ethical in your use of it. And, be proud of the work you’ve done to bring new power to your predictive analytics using Open Data “The Right Way”.
This post was written by Bob Lytle