Government in the 21st Century: Cloud, Ads, & Machine Learning
As it stands today, the Federal Bureaucracy is more of a collection of individual, lightly-coordinated programs than singular coherent agencies. Over time, as new sub-agency centers have been formed to tackle the specific issues of the day, the size and scope of the American bureaucracy has ballooned to roughly 430 unique departments, agencies, and sub-agencies.
Simply maintaining these existing architectures is one of the costliest endeavors the Government participates in, costing around $100B per year in IT spending with 125 separate agencies who work on data collection or dissemination, and 70 projects using big data analytics (since 2010) to the tune of another $85–90B per year. Most of that fractured work is simply managing the architecture in place or performing slight (siloed) improvements.
To manage these processes, the US Government has created an army of software applications specifically tailored to each individual need, with little overlap even within agencies. This lack of centralization often leads to a myriad of blind spots in the pursuit of successfully executing government initiatives.
As an example, over the last few years our team has worked with the US Department of Health and Human Services (HHS) and the sub agencies of CMS, FDA, CDC, SAMHSA, and other in public health-based media initiatives on literally hundreds of media campaigns. Many of these campaigns running simultaneously had something in common — when they chose the most ‘at risk’ markets to focus on, they all included the county of Huntington, West Virginia. In fact, I’ve worked on over two dozen campaigns from HHS that have used broad-based risk indicators to target the 300k+ people in the greater Huntington area. The impacts of those media efforts were then analyzed completely separately, on separate platforms, with a separate group somewhere at HHS performing a meta-analysis on health outcomes broadly.
This discrete process arguably introduces flaws in literally any clinical methodology, as much of the field analysis work is performed using randomized control trials, which are the “gold standard” of fact finding (in the words of the immortal Cass Sunstein), but which are a precarious and time-intensive approaches to extracting the truth out of a situation. If the CDC’s opioid related work uses a control group of individuals who have totally separately gotten a different anti-drug message from the FDA, the results can (and often are) muddied without anyone being the wiser, potentially discounting any findings. The inability for researchers to replicate studies is not simply an issue restricted to Government, as John Iaoannidis’ famous 2005 meta-analysis postulated that around 50% of all “current published research findings are false.”
Trying to untangle the relative impact(s) across agencies is often difficult as well due to variations in definitions of the “same” things. The USDA defines what a “rural” area is largely based on population, I.E. <1000 people per square mile, versus the Census Bureau which uses a more geographical definition, and overlays that with a further requirement of a population of <2500. If the Census Bureau has information material to the USDA’s outlays, the first hurdle is defining what “rural” really means and conforming the datasets accordingly. The list of issues like this goes on and on: there are meta data quality issues stemming from differences in factoring in margins of error, differences in what technically constitutes ‘variance’ from an observed norm, different definitions of what is or is not a leading survey question, differences what file format(s) information is stored in, differences what devices users can even access information on, lack of cultural or language diversity in studies, etc.
With all of that said, there is hope on the horizon. While metadata can be incoherent due to the issues above, the Obama administration in 2013 established via executive order the Open Data Policy, promoting both open datasets as the name implies, but also the creation of systems specifically designed to promote “interoperability between information systems.” The requirements continue to specifically outline machine-readable formats and a systems architecture essentially analogous to an open source platform for Government use.
This seemingly simple request represents a fairly seismic shift from the custom-software-driven approach that has been the norm since the Government came online in the late 1990s. In this new environment, many public officials are essentially at a loss of how to proceed, and who can blame them? Paired with a rapidly changing technological environment, there have huge increases in the sheer volume of information available due to the jump of 35% penetration of smartphones in 2010 to 77% in 2016, and of the explosion to 8.1 billion digitally connected sensors or “things” generally in 2017 (up 31% from 2016). Google released TensorFlow in 2015 because we had to think of a new way to solve some of our most pressing issues internally, due in no small part to this raw information explosion.
When faced with these same challenges the Government now faces, technology companies developed two (then) radical solutions — first, shifting the decision making process from control study-led to (mainly bayesian) statistics-driven decision making, and second managing that new ecosystem not by using rigid on-premise devices but by utilizing a far more flexible Cloud-based ecosystem. Instead of having fleets of sys devs maintaining legacy systems, the complex technical architecture was handed off to Cloud platforms so groups can focus more of their time and resources on whatever the actual things they are working on — e.g. making Google Search the best and most useful website on the internet (just try and convince me otherwise).
As an analogy, it’s like hiring a construction firm to get all the materials needed to build your house rather than trying to do it all yourself. If you want a house as fast as possible, would you try to wade through the logistical nightmare of sourcing correct materials, or would you want to spend time designing the house to fit your family’s needs? The lack of experience with sourcing would be a huge problem too — how would you know if someone is selling you bad lumber? Would building your house out using only 50% “good” lumber be remotely acceptable?
Google has the best-in-class platforms (“materials”) to start to tie these data streams (or batches!) together. Google DataFlow is a tool to integrate data from multiple sources and in a wide variety of shapes to prepare it for a concentrated analysis. It’s specifically designed to increase speed and accuracy while reducing cost and complexity of classic data science work. That info (or others) can then be analyzed in Google BigQuery, an information warehouse (and my favorite) tool for data analysis, pre packaged with the same TensorFlow-based machine learning models Google uses, also specifically designed to scale easily to petabytes while keeping costs low. It’s fully managed by Google with zero-ops, supporting SQL, OBDC/JDBC connectors, stackdriver, and a wide variety of other features.
In the Cloud-based world, the model flips from ‘what study can I do to determine truth?’ to ‘what data sources do I need to be able to determine truth?’ The modeling work changes from mathematical tools used to tease as much information from one study as possible (ANOVA, T-Tests, chi square, etc) to large scale statistical analysis formats (time series smoothing, logistic regressions, general linear algebra, etc). As Sunstein also said no choice is risk free and Big Data-based analytics certainly has its own set of flaws, so the randomized control trial approach is still extremely valuable but serves a different purpose — namely to answer questions of clashing metrics. E.g. Both changing physical location and cold turkey-based rehab work have shown to be effective to varying degrees at combating drug addiction. A study could be used to try to develop a working model of how much weight should be applied to each and when, and critically can be constantly improved in close to real-time.
Google built these technologies to manage the insane volume of information that flows through the seven 1B+ user properties we maintain (e.g. 300 hours of videos are uploaded to YouTube every minute). In making these platforms, we’ve also incidentally created some of the most valuable datasets in the history of public policy — e.g. 1 of every 20 queries on Google.com is personal health related, making up several thousand in the US every single second. Recently, we partnered with HHS to make results on Google.com more accurate to the affected population for HIV/AIDS queries. Any advertiser knows their Google Search campaigns are one of the only ways to get real time user feedback, and as I know well most public-facing initiatives have some advertising component that could use advertising data they control to jumpstart the shift to the new era.
This concert of Cloud technology we designed to manage user information efficiently and effectively, advertising-based data streams, and best-in-class machine learning tools makes Google the ideal partner of any government agency looking to perform more effective public policy programs. We’re just starting some of our most exciting work in this space, and I hope to have concrete results to share soon.