Category Archives: Big data

Predictive Analytics in the Public Sector

shutterstock_218879485-700x467My colleague Rainer Kattel (Tallinn University of Technology, Tallinn) and I are in the process of conducting interviews on digital transformation in the Estonian government. By coincidence we came across an interesting practice: the use of Big Data to review customs and financial data streams with the goal to reduce corruption. I wrote this up as a short contribution for the German Behörden Spiegel – a newspaper for public managers.

Here is the text (adapted from the German version – scroll down for the original text):

Big Data are Internet-generated data from online interactions of humans with websites or passive data collection by computer networks or physical sensors.The resulting data sets are usually defined as “big” because of its size, the speed in which they are generated, and the possibilities for predictive analytics and real-time insights into behavioral preferences of citizens.

Traditionally, public sector organizations are operating mostly with administratively designed and collected that results out of the direct interactions with citizens, includes other government records, and mostly includes data sets, such as open data, or other transactional data. It usually goes through an extensive cleaning and analysis process until it is made available with significant time delays (in the case of census data even years of delay). Oftentimes, the use of this ‘old’ data is used for predictive analytics to project the potential needs of citizens. Big Data however are automatically generated data sets, unstructured, and matching it with administrative data requires significant effort to match them with administrative data for the use by public managers.

Using the example of the Estonian customs and tax services, Big Data analytics can help to fight corruption in near real-time. Based on standardize cash flows, the Estonian tax and customs analysts have created risk profiles for different types of organizations. Every company is matched up with one of the profiles. These are continuously compared to cash flows and daily updates and adjustments are done in case of minor deviations. In addition to the risk profiles, so-called Key Performance Indicators in combination with additional data sets, such as banking transactions, invoices, business registers, lang register entries,, etc. In addition, data from online auction sites are used to find out if sellers are paying their sales taxes.

In case of anomalies between the expected tax incomes and the risk profiles of companies, based on a predefined algorithm, warnings are sent to the analytics team. After a first review, they decided what Information to forward to the specialists who will conduct their own ad hoch investigations. Using the analytical assessment in combination with the specialists’ experiences and assessments, a more detailed risk assessment is derived. As a result, either the risk profile is adjusted, or auditors are launching a tax examination on site on the same day.

This type of real-time analysis and timely interpretation of large-scale data sets allows the Estonian tax and customs authorities to assess information about the current tax situation and potential corruption cases in real time.

In the future, predictive analytics tool can be used to identify patterns about the health of individual companies. Predictive analytics can be used to understand the potential economic and social impact in case of impending bankruptcies. Using big data analytics can help government make more effective and efficient decisions, be potentially better prepared and act preventatively.

===================

Here is the full text in German and a link to the article.

Korruptionsbekämpfung in Echtzeit

Big Data sind Internet-generierte Daten, die sich aus den Onlineinteraktionen von Menschen mit Webseiten und physischen Sensoren ergeben. Die resultierenden Datensätze, die allgemein aufgrund ihrer Größe, der Schnelligkeit ihrer Erstellung und den daraus resultierenden Möglichkeiten zur Echtzeitanalyse definiert werden, erlauben der öffentlichen Verwaltung Einsichten in die Bedürfnisse und tatsächlichen Handlungen von Bürgern. Sie stellen eine Kombination aus Social Media-Daten wie geteilten Videos und Fotos, likes/shares, Onlinebanking, Onlineeinkäufen, und Mobilfunkdaten dar.

Traditionell arbeitet die öffentliche Veraltung mit administrativ designten und aufwendig gesammelten Datensätzen, die vor allem aus den direkten Interaktionen mit Bürgern entstehen. Administrative Daten können einem Vorgang und individuellen Personen oder Haushalten zugeordnet werden. Beispiele dafür sind Zensusdaten, oder bisherige bearbeitete Fälle, die in Kombination mit professionellem Verständnis der Beamten für sogenannte predictive analytics dazu genutzt werden zukünftige Trends vorherzusagen. Dagegen werden Big Data-Datensätze automatisch generiert, sind unstrukturiert, und bedürfen hohem Einsatz um die Daten für die öffentliche Verwaltung nutzbar zu machen.

In Kombination können Big Data und administrative Daten dazu beitragen die Fachaufgabe der öffentlichen Verwaltung effizienter und effektiver zu gestalten. Dies zeigt sich am Beispiel der Estländischen Steuerbehörden, die Big Data-Analysen einsetzen um schnell Steuerhinterziehung zu identifizieren um möglichst noch am gleichen Tag die Ermittlungen vor Ort einzuleiten.

Die Zoll- und Finanzbeamten haben basierend auf standardisierten Finanzströmen für unterschiedliche Unternehmensformen zunächst sogenannte Risikoprofile angelegt, die mit echten Finanzdaten getestet werden, und kontinuierlich – wenn notwendig sogar täglich – dem tatsächlichen Geschäftsgebaren angepasst werden. Zusätzlich zu den Risikoprofilen dienen sogenannte Key Performance Indicators – Leistungskennzahlen – in Kombination mit den weiteren Datensätzen wie z.B. Banküberweisungen, Rechnungen, Unternehmensregister, Grundbucheinträgen. Aber auch Daten von Internet-Autobörsen werden miteinbezogen, um herauszufinden ob Verkäufer ihre Einkommen versteuern.

Sobald sich Abweichungen zu den steuerpflichtigen Finanzströmen ergeben, die dem Profil des Unternehmens nicht entsprechen, werden aufgrund der vordefinierten Algorithmen Warnungen an das Analyseteam geschickt, die die Daten mit ihrer eigenen Einschätzung an die Fachabteilung weitergeleiten. In Kombination mit den fachlichen Einschätzungen der Fachbehörden und den durch die Risikoanalyse entsteht somit eine klarere Risikoeinschätzung, die die Steuer- und Zollbehörden nutzen um weitere Schritte einzuleiten. Entweder werden die Risikoprofile des Unternehmens auf die neue Situation angepasst, so dass keine Warnungen mehr entstehen, oder Betriebsprüfer leiten Kontrollen noch am gleichen Tag ein.

Diese Art der Echtzeitanalyse und –interpretation von großen Datenströmen erlaubt es den Estnischen Steuer- und Zollbehörden Informationen über die gegenwärtige Steuersituation des Landes zu ermitteln. Zukünftig können die bereits etablierten Tools auch dafür genutzt werden um aus den in den Finanzströmen erkennbaren Mustern vorherzusehen, ob es einem Unternehmen schlecht gehen wird. Predictive analytics können dann auch dazu beitragen die Belastungen des Staates und das Aufkommen potentieller sozialer Probleme frühzeitig zu erkennen und eventuell präventiv einzugreifen – zumindest vorbereitet zu sein.

 

Professor Dr. Ines Mergel ist Professorin für Public Administration an der Universität Konstanz wo sie zu Themen der Digitalen Transformation der öffentlichen Verwaltung forscht und lehrt. Kontakt: ines.mergel@uni-konstanz.de

LSE Impact of Social Sciences blog: What does Big Data mean to public affairs research? Understanding the methodological and analytical challenges

The following text was originally prepared for LSE’s Impact of Social Sciences Blog and reposted here.

===

The term ‘Big Data’ is often misunderstood or poorly defined, especially in the public sector. Ines Mergel, R. Karl Rethemeyer, and Kimberley R. Isett provide a definition that adequately encompasses the scale, collection processes, and sources of Big Data. However, while recognising its immense potential it is also important to consider the limitations when using Big Data as a policymaking tool. Using this data for purposes not previously envisioned can be problematic, researchers may encounter ethical issues, and certain demographics are often not captured or represented.

In the public sector, the term ‘Big Data’ is often misused, misunderstood, and poorly defined. Public sector practitioners and researchers frequently use the term to refer to large data sets that were administratively collected by a government agency. Though these data sets are usually quite large and can be used for predictive analytics, administrative data does not include the oceans of information that is created by private citizens through their interactions with each other online (such as social media or business transaction data) or through sensors in buildings, cars, and streets. Moreover, when public sector researchers and practitioners do consider broader definitions of Big Data they often overlook key political, ethical, and methodological complexities that may bias the insights gleaned from ‘going Big’. In our recent paper we seek to provide a clearer definition that is current and conversant with how other fields define Big Data, before turning to fundamental issues that public sector practitioners and researchers must keep in mind when using Big Data.

Defining Big Data for the public sector

Public affairs research and practice has long profited from dialogue with allied disciplines like management and political science and has more recently incorporated insights from computational and information science. Drawing on all of these fields we define Big Data as:

“High volume data that frequently combines highly structured administrative data actively collected by public sector organizations with continuously and automatically collected structured and unstructured real-time data that are often passively created by public and private entities through their internet.”

This definition encompasses the scale of newly emerging data sets (many observations with many variables) while also addressing data collection processes (continuous and automatic), the form of the data collected (structured and unstructured), and the sources of such data (public and private). The definition also suggests the ‘granularity’ of the data (more variables describing more discrete characteristics of persons, places, events, interactions, and so forth), and the lag between collection and readiness for analysis (ever shorter).

Methodological and analytical challenges

Defined thus Big Data promises access to vast amounts of real-time information from public and private sources that should allow insights into behavioral preferences, policy options, and methods for public service improvement. In the private sector, marketing preferences can be aligned with customer insights gleaned from Big Data. In the public sector however, government agencies are less responsive and agile in their real-time interactions by design – instead using time for deliberation to respond to broader public goods. The responsiveness Big Data promises is a virtue in the private sector but could be a vice in the public.

Moreover, we raise several important concerns with respect to relying on Big Data as a decision and policymaking tool. While in the abstract Big Data is comprehensive and complete, in practice today’s version of Big Data has several features that should give public sector practitioners and scholars pause. First, most of what we think of as Big Data is really ‘digital exhaust’ – that is, data collected for purposes other than public sector operations or research. Data sets that might be publicly available from social networking sites such as Facebook or Twitter were designed for purely technical reasons. The degree to which this data lines up conceptually and operationally with public sector questions is purely coincidental. Use of digital exhaust for purposes not previously envisioned can go awry. A good example is Google’s attempt to predict the flu based on search terms.

Second, we believe there are ethical issues that may arise when researchers use data that was created as a byproduct of citizens’ interactions with each other or with a government social media account. Citizens are not able to understand or control how their data is used and have not given consent for storage and re-use of their data. We believe that research institutions need to examine their institutional review board processes to help researchers and their subjects understand important privacy issues that may arise. Too often it is possible to infer individual-level insights about private citizens from a combination of data points and thus predict their behaviors or choices.

Lastly, Big Data can only represent those that spend some part of their life online. Yet we know that certain segments of society opt in to life online (by using social media or network-connected devices), opt out (either knowingly or passively), or lack the resources to participate at all. The demography of the internet matters. For instance, researchers tend to use Twitter data because its API allows data collection for research purposes, but many forget that Twitter users are not representative of the overall population. Instead, as a recent Pew Social Media 2016 update shows, only 24% of all online adults use Twitter. Internet participation generally is biased in terms of age, educational attainment, and income – all of which correlate with gender, race, and ethnicity. We believe therefore that predictive insights are potentially biased toward certain parts of the population, making generalisations highly problematic at this time.

In summary, we see the immense potential of Big Data use in the public sector, but we also believe that it is context-specific and must be meaningfully combined with administratively collected data and purpose-built ‘small data’ to have value in improving public programmes. Increasingly, public managers must know how to collect, manage, and analyse Big Data, but they must also be fully conversant with the limitations and potential for misuse.

This blog post is based on the authors’ article, ‘Big Data in Public Affairs’, published in Public Administration Review (DOI: 10.1111/puar.12625).

Note: This article gives the views of the author, and not the position of the LSE Impact Blog, nor of the London School of Economics. Please review our comments policy if you have any concerns on posting a comment below.

About the authors

mergelInes Mergel is full professor of public administration at the University of Konstanz’s Department of Politics and Public Administration. Mergel focuses her research and teaching activities on topics such as digital transformation and adoption of new technologies in the public sector. Her ORCID id is 0000-0003-0285-4758 and she may be contacted at ines.mergel@uni-konstanz.de.

rethemeyerKarl Rethemeyer is Interim Dean of the Rockefeller College of Public Affairs & Policy, University at Albany, State University of New York. Rethemeyer’s primary research interest is in social networks and their impact on political and policy processes. His ORCID iD is 0000-0002-5673-8026 and he may be contacted at kretheme@albany.edu.

isett_portraitKimberley R. Isett is Associate Professor of Public Policy at the Georgia institute of Technology. Her research is centred on the organisation and financing of government services, particularly in health.  Her ORCID id is 0000-0002-7584-0181 and she may be contacted at isett@gatech.edu.

New paper: #BigData in Public Affairs published in PAR

screen-shot-2016-09-13-at-8-03-17-amKarl Rethemeyer, Kim Isett, and I just published a new paper in Public Administration Review with the title “Big Data in Public Affairs“.

Our goal for this article is to define what big data means for our discipline and raising interesting research questions that have not been explored yet. Here is the abstract of our article. Please email me if you can’t access the full paper:

This article offers an overview of the conceptual, substantive, and practical issues surrounding “big data” to provide one perspective on how the field of public affairs can successfully cope with the big data revolution. Big data in public affairs refers to a combination of administrative data collected through traditional means and large-scale data sets created by sensors, computer networks, or individuals as they use the Internet. In public affairs, new opportunities for real-time insights into behavioral patterns are emerging but are bound by safeguards limiting government reach through the restriction of the collection and analysis of these data. To address both the opportunities and challenges of this emerging phenomenon, the authors first review the evolving canon of big data articles across related fields. Second, they derive a working definition of big data in public affairs. Third, they review the methodological and analytic challenges of using big data in public affairs scholarship and practice. The article concludes with implications for public affairs.

Reference:

Mergel, I., Rethemeyer, R. K., Isett, K. (forthcoming): Big Data in Public Affairs, in: Public Administration Review, DOI: 10.1111/puar.12625.

New IBM Report: A Manager’s Guide to Assessing the Impact of Government Social Media Interactions

IBM’s Center for the Business of Government has published a new report: “A Manager’s Guide to Assessing the Impact of Government Social Media Interactions“.IBM Center for the Business of Government: A Manager’s Guide to Assessing the Impact of Government Social Media Interactions

This new report addresses the key question of how government should measure the impact of its social media use.

Social media data – as part of the big data landscape – has important signaling function for government organizations. Public managers can quickly assess what citizens think about draft policies, understand the impact they will have on citizens or actively pull citizens ideas into the government innovation process. However, big data collection and analysis are for many government organizations still a barrier and it is important to understand how to make sense of the massive amount of data that is produced on social media every day.

This report guides public managers step-by-step through the process of slicing and dicing big data into small data sets that provide important mission-relevant insights to public managers.

First, I offer a survey of the social media measurement landscape showing what free tools are used and the type of insights they can quickly provide through constant monitoring and for reporting purposes. Then I review the White House’s digital services measurement framework which is part of the overall Digital Government Strategy. Next, I discuss the design steps for a social media strategy which will be basis for all social media efforts and should include the mission and goals which can then be operationalized and measured. Finally, I provide insights how the social media metrics can be aligned with the social media strategic goals and how these numbers and other qualitative insights can be reported to make a business case for the impact of social media interactions in government.

I interviewed social media managers in the federal government, observed their online discussions about social media metrics, and reviewed GSA’s best practices recommendations and practitioner videos to understand what the current measurement practices are. Based on these insights, I put together a comprehensive report that guides managers through the process of setting up a mission-driven social media strategy and policy as the basis for all future measurement activities, and provided insights on how they can build a business with insights derived from both quantitative and qualitative social media data.

 

Media coverage:

 

Big Data in Government

[Originally posted on NextGov.com: Follow Philly’s Lead and Dive into the Big Data Future]

Big data is valuable data in government
Chief Data Officer Mark Headd, City of Philadelphia

Big data” has become one of the new buzzwords and it is quickly making its way into conversations in government. However, it is difficult for government officials to identify what the big data discussion means for their own organizations, what the challenges are, how they can create additional capacity taking on a job that does not necessarily connect to the core mission of their agency and how they have to tackle the issue to respond to requests from the public.

The big data discussion hits government from two different sides: First, big data is created by citizens in their daily online interactions using social media either directly with government or talking among themselves about issues related to government. As the recently released first guidance for social media metrics for federal agencies shows, government is just now getting into the groove of measuring, interpreting and acting on insights they can potentially gain from their interactions with citizens. The other trend has started a few years ago with the newly initiated conversations around open government and the launch of the federal data sharing site data.gov, a public website that hosts hundreds of data sets produced by federal agencies.

Originally, the big data discussion started outside of government, but has direct implications for government as more and more agencies, politicians and citizens are using social media to interact with each other. Social networking platforms, such as Facebook or Twitter, allow citizens to directly connect to government agencies, share their immediate sentiments via comments in their own news feed  In doing so, they create hundreds of new data points, that increase the data volume far beyond a single phone call with a citizen request. As a matter of fact, the conversations can go back and forth between government and citizens, but also among citizens. Social media conversations might not even directly involve government, but they are related to ongoing hot-button issues, upcoming policy changes or the cut of a government program.

Keeping track of potentially thousands of externally created data points published by citizens on a daily basis has become an unmanageable problem that is slowly being addressed in the public sector. As a response, some agencies have shut down the possibility to leave comments on their Facebook pages reducing the cost to respond and track, others actively pull in citizen input or moved on to other ideation platforms that focus the conversation on a specific problem statement and streamline the solicitation of targeted responses and input from the public (see for example Challenge.gov).

The second trend that government agencies are facing is the mandate of the Open Government Initiative to release government data sets in machine-readable format for public consumption.  The flagship initiative data.gov has paved the way for state and local governments to respond in a similar fashion. Most recently, NY state has released its own data portal, a website that hosts about 6,500 data sets from state, city and local government sources in NY state.

The challenge for public managers is manifold: they have to identify appropriate data sets, clean them, potentially merge them from different databases, and make sure that they do not contain any individual or personal information that cannot be released to the public by law. Independent of each agency’s individual response, given the multitude of citizen interactions and ongoing conversations in combination with the top-down mandates, additional resources, increased capacity and new positions with a specific skill sets are necessary to appropriately respond. Beyond the internal organizational challenges to manage information streams, big data is much more: Government agencies also need to understand how they can open themselves up for third parties who are reusing the data.

Mark Headd, the newly appointed first Chief Data Officer of the City of Philadelphia, recently spoke to my social media class at the Maxwell School and shared his first-hand insights into the world of Big Data in government.  Mr. Headd was appointed through an executive order of Mayor Nutter in Philadelphia and organizationally embedded within the ICT unit and directly reports to the CIO and mayor. Mayor Nutter made it a political priority to understand and organically implement elements of the open government movement – an advantage that other cities might not have, where Chief Information or Chief Data Officers still need to battle political fights before they can implement change.

Mark Headd

He describes himself as a data evangelist and an embedded technologist who has the task to discover government data, think about ways to make it available to the public and find a match between the data and external stakeholders who can potentially use the data to create public value. Internally, he is focused on cultural change more than on data analysis issues or technological problems: He aims to convince public managers to see the potential value the data can have for the public, start discussions about the reasons to release data and the way government officials view themselves, but most importantly inform them about changing expectations and citizen needs. Mr. Headd then facilitates connections between data sources and potential data users outside of government.

As one of the first Code for America cities in the U.S., Philadelphia’s local tech community of civic hackers has an immense motivation to reuse public information and create valuable applications. As opposed to data.gov, where data sets are mostly available for so-called “elite access” – a small group of highly trained computer specialists and data analysts – the approach in Philadelphia focuses on data that is not highly specialized, already publicly available, such as transit data, day care centers, information about flu shot locations, etc. Most people will consume the existing data through web browsers, either on their desktops or mobile phones. Mark Headd describes Philadelphia’s approach to open data as a focus on the “last mile”. By that he means, that the city invites civic hackers who recombine the existing disconnected data sets in a mindful way to go beyond mere display of data sets, as it is done on data.gov. The city wants to increase value to go beyond merely pushing out data as the main objective, instead they collaboratively want to build new mobile phone applications by recombining data.

Events such as “Code for Philly” in collaboration with Code for America combines members of government collaborating with the local technology community to use data and build new projects that have the potential to create a civic good. Again, Philadelphia comes with a unique advantage: The existing culture, that is similar to citizen such as Boston, Baltimore, is geographically close to NYC has a very active civic technology community with programmers who are passionate about the city, feel a sense of belonging and community, which other cities such as San Francisco which doesn’t have home-grown technologist.  Mr. Headd’s goal is therefore to capitalize on the people’s love of their city.

One example, for Mr. Headd’s success are applications such as CityGoRound.org, which is a clearing house for applications around transit data. Local transit applications are built to help citizens catch their train. In addition, the application and code are also made available for reuse in other cities, by simply plugging in local transit data. Transit authorities agreed to a standard that makes sharing of already existing applications easy – work products don’t need to be reinvented or recreated around the country. As a result, the city and its technology stakeholders are collaboratively building an entire eco-system around government data use. All cities can use the same infrastructure and format to use the data.

One of the challenges Mr. Headd sees is convincing citizens to reuse the data and make use of the applications. One approach Philadelphia has chosen is to advertise the newly created third-party products on public buses (see for example ‘Where I my SEPTA?’). However, the question of endorsing and publicly sponsoring products that were built outside of government is still an unresolved issue.

Another challenge is the cultural change necessary to change existing bureaucratic governance procedures. For Mark Headd the solution is a conversation about effectiveness and efficiency of the current use of government data. He shows public managers he interacts with how they can reduce inefficiencies in responding to a steady stream of Freedom of Information Act-requests (FOIA) to release data to individual citizens or journalists. Every request takes time, is oftentimes burdensome to the unit and labor-intensive to research and respond to. Mr. Headd works together with public managers to look at the top-5 data requests, collaboratively tries to find ways to release the data and at the same time unload the administrative burden off the unit to respond to each request. Responders can simply point requestors to the publicly available data set and save time, resources, and money to research, vet, and formulate responses.

As an example, the Department of Licenses and Inspections receives multiple requests to release data about the number of locations of vacant houses as well as code violations. By releasing the data on a public website, the city allowed developers to create mobile applications and in turn significantly reduces the number of written requests and phone calls. The research activities for similar types of requests are minimized by simply pointing requestor to the new app. Government staff can turn their attention to the core mission, instead of being derailed by FOIA requests. A direct benefit to the release of government data.

Similarly efficient is the reuse of the data on the citizen side: during hackathon events civic hackers are building a service on top of government data sets and are therefore helping themselves, instead of having to reach out to government. A new form of co-delivering public services build on big data.

Mr. Headd shared a few insights on how other Chief Data Officers can tackle the issues in their own cities. He says “Nobody wants to be first, so point people to other success stories in other agencies.” He is constantly evangelizing about the value of big data, but is also informing local and city government and making his colleagues aware of what is going on around the U.S. (and the world), which helps them understand the benefits of releasing data. He suggests to show public managers tangible benefits, instead of talking about less tangible benefits such as openness or accountability which are very difficult to quantify, especially in budget-driven conversations.

Mark Headd sees limitless applications for the release of government data and the analysis and reuse of big data: Budgets, spending, crime or transit data enable people to see how well city employees are doing their jobs and help them make aware of the improvements or new focus area. The big data discussion can help here to talk about high performing government and all the things that work very well in local government. Most of the coverage government receive is unfortunately focused on things that are going wrong – big data can change the focus.

Lastly, social media and government data can then come together to create more personalized connections to citizens by communicating success stories. Citizen engagement will stay the major challenge: Similar to voting, Philadelphia has identified about 40 other processes in which citizen feedback is needed, engagement is low, and new experiments to increase feedback are needed. An application was recently launched to pull citizen opinions into the policy-making processes: Textizen.com allows citizens to send in their feedback by cellphone – without needing an expensive smart phone to actively participate in the policy-making process. By institutionalizing easy to use tools to which every citizen independent of their age group, income or technological literacy has access to, tools like Textizen can become part of a government’s future planning process and can automatically generate input without hosting town hall meetings at which limited numbers of people are participating.

The example of Philadelphia’s success is certainly an outlier: The city is blessed with a unique combination of advantages that other local governments might not have:

  • a political mandate that supports and mandates reuse of public information,
  • a technologist who understands managerial as well as technological and cultural issues in government, and
  • a unique tech community who loves its city and pushes the envelope to innovate.

However, all cities around the U.S. are invited to simply reuse existing applications without reinventing the wheel on a daily basis. Get going with resources that are already freely available and dive into the future of big data in government.