Towards a ‘Local News Engine’ #opendata

1922_Punch_Card_Sorting_Machine_Census_Office_Wash_DC_xWeekly lists of local civic information (law courts, planning, licensing of alcohol, entertainment, adult, gambling etc) are a strong source of story leads for local journalists of all types. These lists can be over 500 pages long, requiring patience, detailed local knowledge and much time and coffee to read.  The lists are largely computer-created but have hitherto resisted automated extraction of newsworthy data.  If this process could be automated, even in part  newsroom productivity and public interest scrutiny could increase.

Almost everyone I speak to about these lists rolls their eyes in despair, in one case at a conference just mentioning the Magistrates Courts lists provoked a platform speaker into a shouting diatribe against the Court Service.     A few people have run experiments where they just get overloaded by data.  The view prevails that the lists are just too hard to work with in data terms and the publishers unresponsive.  However I think there is an opportunity to change this as publication quality and processing capability improve.

As the ‘open data’ agenda matures, the way the data is published is slowly improving, some councils even have a data store with an API, removing the need to scrape data.  Geospatial data is creeping in too as addresses are better recorded.  On the processing side, away from the media industry in the last year or so I have been working with some outstanding data scientists who help councils publish data and work on data standards.  They think it should be possible to extract structured data from the lists in their many forms and use semantic tools/techniques to extract meaning, if the queries are kept simple.

One basic thing a ‘Local News Engine’ (TM) could do would be to seek names of local notable people (politicians, criminals, celebrities), even companies and places doing newsworthy things in the lists (appearing in court, applying for planning permission etc) and report that to journalists as potential story leads in a way that fits into their workflow.  To do this a Local News Engine would have to put a unified front end onto common local datasets, annotating and extracting information from them, and allowing the creation of structured searches that combine searches against key entities (such as lists of names, organisations or places) and events (such as ‘applied for planning permission’) with temporal and geospatial parameters.  LNE would then build up a knowledge graph from these datasets tailored for local news organisation, focussed on the individuals and locations of interest to them.  Then generate scheduled reports triggered on different timetables and in different formats: weekly email, web page, text message, whatsapp, print documents and so-on.

The user requirement for news lead output will drive front end, data source and reporting design.  In the design phase a journalist would simply sit down with the Local News Engine team to debrief on a list of names, places and things they were interested in and say how they wanted their alerts. There would be a strong emphasis on recording the public interest in a person’s activities.

A bit like a steam train, bringing Local News Engine up to pressure for the first time would be the most challenging as the coders work with the idiosyncrasies of each data set.  From experience, in a typical local patch some data is fine, increasingly in some sort of data store with an API, some needs to be scraped through the now well understood hidden APIs of local authorities online services (Northgate etc), and a diminishing amount still needs to be digested from a pdf, court lists in particular.  Subsequent loads of data each period should become easier and eventually automatic.  Replication in other areas should be possible given the commonality (if not complete standardisation) of the data sets.  But each area would require bespoke work on induction.

Story leads generated could look like: ‘Councillor Fred Bonzo Smith appears to be in the magistrates court’, “Footballer Stan Kicker has applied for permission to build a house”, “Six bars near Priory Road have applied for later opening hours in the last six months”, “Seven criminals are in court with addresses near Flat Road”.  These leads would be presented in a report form and at a threshold chosen by the journalist – anything from a weekly report in advance of an editorial meeting or a daily email or an instant alert for something happening to a named individual.  Leads would be internal to the news room environment.

There would be plenty of challenges for a Local News Engine to work through as well as the technology.  These include: issues for office holders suddenly facing algorithmic scrutiny – they would ask if it was ‘reasonable’, the tricky to handle nature of the magistrates court lists, where juvenile and reporting-restricted cases needs to be set aside.  As well as the traditional debate around what is reasonable in the public interest and covered by the journalistic exemption to the Data Protection Act.  In my view, a solid procedure/process should be able to deal with these, particularly using LNE as an in house tool or one limited to a number of local subscribers who are journalists.

The potential here is huge, especially at a time when newsroom resources are diminishing and the time available to read the lists is under severe pressure.  I am working on a prototype to make a Local News Engine a reality, bringing together my interests and networks in open data, local journalism and accountability.    If we get something working the code and techniques will be open as can be so others can exploit it (not the output data though).  If you are interested in helping out, chatting, sharing experiences etc drop me a line  If there’s enough interest I’ll organise a hangout or something.

Follow Will

William Perrin

Founder of Talk About Local, Trustee of the Indigo Trust, member of UK Government transparency panels, former Policy Advisor to UK Prime Minister, former Cabinet Office senior civil servant.Open data do-er, Kings Cross London blogger. Loves countryside. Two kids, not enough sleep.
Follow Will