A
s software continues to eat the world, every company must become a tech company. The good news: Declining cloud compute and hosting costs and open-sourced machine learning algorithms like TensorFlow mean it has never been cheaper and easier to build intelligent software. The bad news: It has never been cheaper and easier to build intelligent software, so this is no longer a competitive differentiator; it’s table stakes.
So, as we enter the “Great Commoditization” era of software, how can a CEO de-commoditize and build a long-term competitive moat around the business? I believe data will be the gold that separates the winners from the also-rans in this next generation of machine-learning-driven software.
But all data sets are not created equal. Deep data sets focused on solving specific problems are better than large, broad data sets. Dynamic, constantly refreshing data sets are vastly superior to static data sets, typically regardless of their size. And, ultimately, the data sets must be proprietary; the harder they are to access or replicate, the wider the long-term moat. Proprietary data is the fuel that can turn empty, commoditized workflow software into rich, defensible recommendation engines.
To build these engines, the data sets must also have a “closed loop,” or a set of inputs that drive outputs. Generally speaking, the way the math of machine learning works is that actions are correlated with outcomes and constantly recalibrated to improve accuracy. If all you have is inputs, like which ads users click to get to your website, but no outputs, like what they bought, you aren’t going to be able to train the AI to do anything.
So how can you build a deep, dynamic, proprietary, closed-loop data set, especially if you’re starting from ground zero?
From Zero To Defensible
It starts with rethinking the relationship between your product and your data. Data collection can’t be a series of one-off exercises used to inform company strategy; it must be built into the core product itself. In other words, every time a user engages with your product, you must collect data from the interaction that systematically makes future usage of the product even better, across all users in the network.
Bradford Cross, CEO of Merlon Intelligence, describes how to build this data flywheel into your products: “[Ensure] you are capturing totally unique data over time from how your product is used, and that data capture is designed precisely to serve the needs of your models, which are designed to serve the needs of the product functionality, which is designed to meet the needs of the customer. This data value chain ensures that the customer’s motivation is aligned with your motivation to compound the value of your proprietary data set.”
How can you build a deep, dynamic, proprietary, closed-loop data set if you’re starting from ground zero?
Building out an effective data flywheel can be the key to achieving a state of competitive nirvana I call "compounding." Once your business enters this phase, every new customer you add makes the data set and thus the product better, which attracts more customers, which makes the data set better, etc. For this model to work most effectively, you have to rethink how data is used across your customer base; data from every user in your network must be used to improve the product for every other user in the network, regardless of which customer they may work for. This requires significant architecture (technical, legal and data security) to work, but is key to maximizing the value of the system.
Google and Amazon have built the most formidable businesses the world has yet known by leveraging this model. We believe there is an even more exciting opportunity to use this model to build companies that don’t harvest users’ data to sell to them more effectively but instead help them complete their tasks more effectively. We call these businesses Coaching Networks, and they use AI not to automate workers away but to augment them in real time while they are performing their jobs.
Textio is a good example of a Coaching Networks startup that has built a data flywheel, initially focused on the recruiting space. Textio Hire optimizes job posts to help recruiters hire the right people faster. As recruiters write job posts in Textio, the software highlights words and phrases to suggest tweaks that would improve the likelihood of attracting a targeted candidate profile. Once the job is posted, Textio tracks the candidates that apply and automatically updates its model and thus the suggestions it makes to every recruiter in the network. The product improves with every user, which improves the outcomes for every customer, which leads to a compounding data set. After three years in market, Textio has amassed thousands of users, which have collectively built the world’s largest database of job posts and outcomes—370 million strong as of this writing.
But no company starts with this user-driven flywheel. They start with hacks.
Every time a user engages with your product, you must collect data from the interaction that systematically makes future usage of the product even better, across all users in the network.
Types Of Data Hacks
Data hacks can be placed along a spectrum from aggregation hacks to creation hacks. The former start with existing data sets pooled together in some interesting way. These can be relatively straightforward to hack together and, as a result, are the most common types of startup data hacks.
On the other end of the spectrum are creation hacks, which, as the name suggests, involve the generation of data that hasn’t existed before (at least in a structured manner). These tend to be harder and are thus a rarer starting point.
Another dimension along which these hacks can be understood is how proprietary the data hacked together is. How hard is it for others to replicate a meaningful portion of the data and in what time frame? While the ultimate goal is to achieve a defensible, compounding data set, using less proprietary hacks to get you started on the journey can be effective if you move quickly.
The hack or hacks you start with will depend on the assets you start with. Established companies may have unstructured data sets they can work to structure. They may also have large staff they can co-opt into data hackers.
Startups, with fewer assets by definition, often have to be more creative. Indeed, the best founders I work with are extremely creative when it comes to designing data hacks.
Let’s explore some of the most commonly used data hacks:
1. Scraping (a non-proprietary aggregation hack):
Perhaps the most common startup data hack, this consists of collecting publicly available but scattered data. This can take the form of scraping websites, online databases or even offline databases. CoreLogic is perhaps the best example of what can be built on the back of offline scraping; they collect public records data from government offices across the country and sell the packaged data to real estate players for large sums.
Pros:
-Can be easy/low cost to start (e.g., build a web crawler)
-Can allow for aggregation of large volumes relatively quickly
-Can allow for fast iteration
Cons:
-Can be easy to replicate
-Can present legal issues (check scraping rules beforehand)
-Can be hard to acquire “output” data necessary to close the loop
2. Partnering (a proprietary aggregation hack):
Another common strategy is to explore partnerships between established entities, such as industry incumbents or governments, that already have large, unstructured data sets and startups that have the talent and focus necessary to structure and make use of it. In exchange for access to this data, startups often offer their partners revenue shares, partial IP ownership, in-kind services or even good old cash. Tractable, which provides an AI that improves car accident repair processes, is a good example of a startup that has hacked its way into success by partnering with industry incumbents.
Pros:
-Can provide valuable closed-loop, input/output data pairings for startups
-Can provide a competitive barrier to entry for startups if partnership involves exclusivity
-Can provide incumbent an opportunity to leverage an unused asset and move towards building its own data flywheel
Cons:
-Data is often unstructured and/or requires significant cleansing
-Can result in a serious tax on the startup, both financially and legally, if not well-structured. The best partnership deals often involve the startup providing in-kind (free) services to the incumbent in exchange for data
3. Crowdsourcing (a non-proprietary creation hack):
Crowdsourcing is a popular, low-tech way to seed a data set. It can take a variety of forms, from leaders asking their teams to collect data (e.g., take photos, create surveys, label data, etc.) to outsourcing these tasks to workers on services like Mechanical Turk.
Pros:
-Can tailor the data sets created to specific needs
-Can be the easiest and cheapest hack
Cons:
-Depending on the tactic, can be hard to scale
-Can be easy to replicate, so important to move on quickly to other hacks
4. Workflow First (a proprietary creation hack):
A popular “two-step” data hack is to start by building workflow services, driving usage via the workflow, and then looking for ways to make use of the data captured. Salesforce, perhaps the quintessential cloud workflow provider, is looking to move in this direction with its Einstein offerings.
Pros:
-Can monetize early with the traditional benefits of a SaaS model
-Can build workflow to capture closed-loop, input/output data pairings
Cons:
-Hard to build a “two-step” business from a product, customer and talent perspective. Most don’t make it to step two
-Can be challenging legally, as contracts need to clearly allow for data sharing across customers from the beginning. Need to have a bulletproof data privacy and security strategy and team in place
I believe data will be the gold that separates the winners from the also-rans in this next generation of machine learning-driven software.
Companies often layer on a variety of hacks on their journey to the flywheel. Textio started by scraping public job boards. To create a synthetic closed, input/output loop, it assumed that the time a job remained posted on the board was inversely correlated with the quality of the job post (better posts got filled faster).
This allowed it to build a rough algorithm that was good enough to approach potential partners. They worked with a few large employers and provided them free job post optimization in exchange for historical job post and hiring data. This injection of large closed-loop data sets allowed it to improve the product to the point that Textio could start selling it on its own to paying customers. As more customers used the product, it continued to improve, and the flywheel began to spin.
As we enter the Great Commoditization era, CEOs of everything from freshly minted startups to established incumbents will need to answer two critical new questions: Which data hack or hacks are you pursuing, and how will they lead you to a flywheel?
Comments
Post a Comment