While traditional data mastering works just fine for small, simple problems, a new approach was needed for larger, more complex projects, according to software company Tamr.
As enterprise data grows exponentially, decades of technologies have failed to address the challenge of large data volume and variety and unintended data silos. And there is often enormous value in integrating these silos for customer cross-selling, optimizing supply chains and title mastering.
“Today, everyone is focused on” artificial intelligence (AI), “data science and data models but, if you focus too much on these end goals only, your data science projects will fail, Mollie Rose, sales manager-enterprise, commercial vertical at Tamr, warned during the Data, Innovation & Collaboration breakout session “Data Mastering at Scale” that was part of the April 13 EIDR Annual Participant Meeting.
“Clean data must be at the forefront of these projects because if you have bad data going in, you’ll have bad output coming out,” she said.
The industry and enterprises overall “have been obsessed with the big data problem,” she pointed out. “The good news is though that it isn’t much of a problem anymore; we’ve pretty much solved it.”
Companies have “invested billions and billions of dollars moving big data around using the best possible infrastructure but, even with all that investment, we still don’t have a complete single view of our titles or clients [while] our analytics are still stale and incomplete,” she noted.
And “worst of all,” regardless of where you keep your data, “people will say that it can’t be trusted because there’s multiple records,” she said. “So we spent all of this money moving data around and trying to solve this big data problem but something still isn’t right. And that is the bad data problem.”
How Tamr Solves the Bad Data Problem
Tamr is part of a new generation of companies” that was out to solve the bad data problem, Rose said, noting the company was started in 2013 and developed out of MIT’s Computer Science and Artificial Intelligence Laboratory through a project called Data Tamer, which is where the company’s name came from (minus the e).
“In working with various industry leaders, we found that the core principles across them are being agile, delivering pipelines that are continuous, being collaborative and best of breed, and executing that in a scaled-out manner,” she said, adding: “Those are really core to how we deliver our technology and you’ll see more of that as we go along.”
Tamr provides a machine learning solution and it is “cloud-native on all three of the major providers” of cloud services (AWS, Google Cloud and Microsoft Azure), she said.
Tamr uses a “probabilistic approach to data mastering, so we don’t use rules or rules-based methodologies and instead use fuzzy matching and machine learning to master the data so that we’re able to train the models,” she explained.
Traditional rules-based mastering for silo integration does not scale, according to Tamr. It is well known that rule systems will work as long as the rule base is small (say 500 rules). When a mastering project requires substantially more than that, traditional projects tend to fail.
“At the core of what” Tamr is delivering is enabling organizations to “take disparate records” and be able to figure out whether The Simpsons: Werking Mom is the same thing as Werking Mom: Simpsons, Rose noted.
Tamr helps organizations “master that data into a golden record that has the most complete and up-to-date information,” she said, adding: “The techniques we apply to get there is a mix of curation. So performing the schema and entity resolution feedback as well as using human guidance in order to master the data in a way that is going to be relevant to your business. So your feedback here is extremely important, ensuring that the data is being mastered in a form that people actually care about, understand and that they’re going to use.”
Tamr also supports the ability to “bring in external data as needed in order to augment your internal data” to provide a “complete view,” he noted.
“Once you have that strong foundation of master data, you’re going to have analytic uses of it, so pushing it into solutions like Qlik, Tableau or Looker,” she said, adding: “You’ll also have many operational uses of it. So getting that information back into whatever solution you’re using to interact with it on a daily basis.”
Tamr built it in a way that she said “makes it simple for you to have good quality data to power your downstream analytic and operational use cases as well as they help you meet your AI goals.”
Problems Across the M&E Supply Chain
“There are a lot of data problems across the very complex media and entertainment supply chain,” Rose went on to say, explaining: “It can be titles. It can be talent. The audience. It can be the back office looking at supply chain and procurement. It can be your distribution plan with your exhibitor. It can be what EIDR’s trying to solve with a unique entertainment ID. But you also have other companies with entertainment IDs that provide information and so you need to bring this together into a clean, connected and classified way.”
Tamr does that with “higher data accuracy by reducing manual workflows significantly and scaling it across the enterprise and across all these different data domains that needs to come together to give you that comprehensive view,” she told viewers. “This helps you understand your data is an asset and that you can really use your data to start unlocking transformative insights or your business.”
She pointed to Creative Artists Agency (CAA) as an example of a company that used Tamr’s solution.
“It used to take them several weeks to match a query of what talent should be put in front of which audience,” she said. “Now, with Tamr, they’re able to do this at scale across all of the different talent databases as well to the different attributes associated with that talent. And the best part is they’re able to run queries now in a matter of seconds instead of weeks so they can match the right talent with the right content and opportunities.”
Yizhi Yin, sales engineer and data scientist at Tamr, went on to provide a demonstration of how the Tamr platform can be used to bring movie databases together and match movie titles across different data sets.