New Tech Post - data

DataSift: New Frontiers in Search

Tom Murphy — Thu, 16 Dec 2010 15:03:33 +0000

Sarah Blow is a computer science graduate of Manchester University in England. Sarah is now working on bringing DataSift through its alpha testing phase in preparation for wider use. The filtering technology underpinning DataSift allows for highly refined searches to be able to take place across a selection of social networking platforms.

Five years ago, in her spare time, Sarah founded Girl Geek Dinners to alleviate the isolation that many women may feel while working in what is still a male dominated tech industry.

How did DataSift come to be?

“DataSift is pretty much the back end system that powers TweetMeme. We wanted to rebuild the engine that because we knew we could do more with it. Rather than changing TweetMeme, we created this new brand called DataSift.

Who is it for?

“The types of people we are bringing in are big financial services companies, anyone looking to do marketing and marketing analysis, agencies that are looking after brands and so on. Pretty much anything you can think of where you want to find deeper layers of information.

“While it is in alpha we are looking mainly for developers but with the mindset of bringing anyone in who is filtering and creating content. We want to find out what their needs are beyond the basics of what we’ve got there at the moment.

"We have basically asked, ”What do people want to do with the system?” And now we are looking at ways of packaging it so it can work in the right ways for different groups of users.”

What do people use it for?

“We’ve seen people use it for geo-targeting and geo-mapping content in order to find about particular brands and track them. Also, people use it to find out where their users are based. They can also find the most influential users within their particular market.

Use example: San Francisco 49ers
"For TechCrunch disrupt we demonstrated the capability of DataSift by using publicly available information from a San Francisco 49ers game that was happening that weekend.

"There were three rules which we set up 24 hours in advance."
- Data Collection: "We first created a base rule. That was pulling in information from everyone from at Candlestick Park. Anyone who mentioned the name San Francisco 49ers and anyone who mentioned any of the players from the 49ers and the opposing team.
  "There was no geo data in that in that filtered stream."
- Geo-location: "From there we built a second rule on top. Taking all the output from the first rule we said, “Right, now we want anything in San Francisco." So, if someone has set their Twitter location to San Francisco we could pick them up including their tweets about that particular subject but only that subject."
- Geo-targeting: "Then we decided, that’s good, but it’s not perfect. What we really want to know is who is in the Park that’s really seeing the cool stuff. Can we manage to get some twitpics, for example, from inside the Park from people who were actually there?
  “The only way you can verify they are really in the Park is if they are actually geo-located in the Park. So we have an option on DataSift to do a single point and set a radius around it. We found the geo-target for Candlestick Park and set a one kilometre radius around it which pretty much covers that area and anyone just outside the stadium.
  
  "But it didn’t come back with much at all. It literally came back with one user and they hadn’t done any photographs. They had just tweeted that they were there.”
That only one user was returned according to the parameters that were set up is very interesting in itself. It would have been reasonably valid to have guessed far more returns. It is always worthwhile to remember that what one assumes about a situation and what really happens may be two entirely different things.

This is why having better tools to be able to really drill down into the data and to refine and define the results is so vitally important.
Use example: Starbucks
“We were based in San Francisco at the time. So we tried a different exercise using DataSift where we basically said, “We want to find anyone in Starbucks who has got a PeerIndex score of over 40.” Let’s see who the influential people were in San Francisco at that time and find out which Starbucks they are in today. That was a fun one to do. “

“You could do something similar with breaking news. If you knew there was a story breaking in a particular location and you are a news organisation and you want to filter down to find who the actual, legitimate sources were that were actually really in that location using DataSift would certainly be one way of doing it.”

What is the next step?

“We are aiming to have a drag and drop interface which we haven’t finished yet. Users who don’t necessarily have a strong technical ability and an understanding of the technicalities of it don’t need to. They shouldn’t need to have to have that level of detail to use the system.

“The FSDL language that we have got there we only really expect to be used by developers. It is not really aimed at the general user. But while DataSift is in alpha we’ll teach the general users how to use it in case we take a bit longer doing the other side of it.”

The work being done by Sarah and the DataSift team is promising to be a cutting edge development in information retrieval. If you want to help with their alpha testing you can still sign up at DataSift.

ScraperWiki: Hacks and Hackers Day comes to Dublin

Tom Murphy — Wed, 06 Oct 2010 06:06:23 +0000

Hacks and Hackers Hack Day is taking place in Ireland on the 16th of November during Dublin Innovation Week. The organiser of the day-long event is ScraperWiki. Their aim is to provide the resources that allow anyone with any kind of programming ability to develop, store, and maintain software tools for the purposes of extracting and linking data.

By providing the means to create accessibility to data ScraperWiki can allow interested parties such as journalists to take advantage of initiatives such as the UK Government’s policy to make its data more available to the public. Since the UK Expenses Scandal, where certain British parliamentarians were found to have abused their statutory allowances, journalists have become increasingly aware of the wealth of potential stories that lie in databases around the world. However, this data has usually been stored in a random, unstructured and relatively inaccessible manner.

According to Aine Mcguire, in charge of sales and marketing for ScraperWiki, change has only come recently, “In 2003, a gentleman called Julian Todd contacted the UK Government to find out how various MPs had voted on the war. When he tried to get this information in order to do some analysis on it he was advised by the Cabinet Office that all this information was published in Hansards which is the official publishing body of the UK government. But it was difficult [to access.] It was deep down inside a website and he couldn’t do anything with it.

“So Julian went and scraped all that information from Hansards and...then fed it into a website in the UK called The Public Whip which shows you the voting record of all of the MPs in the UK.

“But it was very controversial as he risked imprisonment for doing this because of Crown copyright. But they didn’t imprison him and it was Julian Todd who came up with the idea for ScraperWiki.”

Active since March, 2010, Aine says Scraperwiki aims to, “build the largest community supported public data store in the world.

“You’ve got Wikipedia which supports content that’s predominantly for text and OpenStreetMap is for maps. What we want to do is create a wiki for data. We’re taking data that is in a very unstructured style and putting it into our structured data store. Where appropriate we’re adding longitude and latitude tags. We’re geo-tagging it which means that data can be mapped.”

In line with its aim of being a worldwide data resource project ScraperWiki has had datasets submitted from countries such as the UK, Brazil, Germany, Estonia, Greece and France to name a few. These datasets cover such subjects as the 11,000 deep sea oil wells around the UK, public transport traffic incidents in London, oil rig accidents and so on.

“As well as being a datastore it’s a wiki for code.” Aine explains, “At the moment if you want to do some programming you would go out into the web somewhere, you download some tools, you would install them in the server. Scraperwiki allows you to directly program on the browser so in effect we’re given you lots of libraries for you to program with.

“You can write a screenscraper that goes that uses any of the libraries we’ve got in our browser technology. You can use Python PHP, or Ruby. So you can go off and scrape without having to install anything on your PC or server.”

An added benefit is that because of the inherently collaborative nature of wikis the possibility exists for code to be updated and improved and shared by other programmers.

Aine describes what to expect from the Hacks meet Hackers Hack Day, “At the beginning of the day we have a little presentation about what a Hacks and Hackers Hack Day is all about. Then we give a little presentation on ScraperWiki although we don’t prescribe that they use it. Then we let the journalists and developers gravitate together to form teams over datasets of interests. Then they go off and hack all day. At six o’clock we ask the project groups to come back and present for three minutes each their particular visualisation of the data set that they have worked on.”

Prizes are then awarded and there is a reception for the participants to attend. At a previous event in Liverpool in July eight projects were produced by journalists and programmers working together using open data.

For data driven journalism to flourish information even with the maximum reasonable amount of access granted by governments around the world the problem still exists of data being stored in data silos. Information has to be accessible not only by other people then those who made the original entries but by other machines as well. Structuring information for greater accessibilty is not going to happen all by itself. It will take the sort of co-ordinated and collaborative effort that organisations such as ScraperWiki offer to really make our world a more open and transparent place to live and work in.

At the moment of writing the Hacks and Hackers day taking place in Dublin is fully subscribed but tickets are still available for the Belfast event on the 13th of November.

It is a free event and Scraperwiki is a not for profit organisation. Please contact Aine through their website if you would be interested in sponsoring a part of the event.

Infographics: Communicating The Essence Of A Tidal Wave Of Information

Tom Murphy — Sat, 22 May 2010 23:15:36 +0000

As databases around the world begin to share and compare their data with ever-greater meaning and relevance through the rapid roll out of Linked Data implementations, it is going to become more and more challenging to corral that data and make it into something user-friendly and practical. After all, data isn't worth anything unless it is usable in some form.

We do know that there is a tidal wave of data coming right at us just over the horizon. According to one source we passed 3 zettabytes (21 zeroes) in 2008. So how so we begin to make sense of it?

One answer to help solve the need for increased intelligibility lies in the nascent field of information graphics or infographics for short. Up until now it has been a geeky/arty sub-genre of the Internet and regarded as something quite separate from the hard-core, often macho world of 'real' coding. But researchers, artists, statisticians and folks from all sorts of other fields are realising that not everyone wants to plough through all those numbers and data tables, and why should they, when a simple picture can tell the whole story.

But infographics has the possibility of being something far more than the mere prettifying of data. Assembling data in this manner to produce an infographic, a chart, or some other means of communicating an idea visually is really content production. The most important rule of content production is tell a story. That is the secret of all the most interesting infographs.

The all-time master (so far) has to be Hans Rosling: it's worth taking a break now and watching his TED presentation where he sets the record straight on widespread notions concerning the 'developing world'. There is even more of his work over at gapminder.org.

In his historical graphs not only can you view a data subject over time, you can also compare it to other neighbouring data subjects. Plus, in the graphs at Gapminder, you can set your own parameters to achieve a very great degree of fine tuning. It is impossible to play with this data without garnering some very interesting insights into how the world has developed over the last hundred and fifty years or so. Up until very recently, for Hans to have been able to communicate this knowledge and information, which he presents in such an understandable and approachable way, would have required hours, days, weeks, months even, to assemble and put together. Then there would be the time spent writing the book or making the film so that he could share the findings and insights with others.

Not only do we have the the chance to make data sensible and easy-to-use, we can - through the application of Linked Data and various new applications - do it in relatively short periods of time.

The possibility arises not just of a new and important channel of communication but of a new and exciting possibility of new art form. There will be a great need for more practitioners in this field with the creativity and talent to be able to make huge swathes of data intelligible and useful.

In the same way that IT departments devolved into separate computer services and web services departments and we now have a further devolution of social media functions as a professional sector in its own right, I can see data representation as becoming an entire skillset/profession as well.