Vevo Data Science
Nov. 30, 2016
I coauthored a piece for Vevo about the data platform we built, give it a read.
Here is the full article:
November 23, 2016. Miguel Alvarado, Jarrod Kahn, Christian Termont & Chip Lambert
Over the last 18 months Vevo has started a process to re-boot the way it develops and releases products for its owned and operated properties. This has been no trivial effort for a company that had not historically prioritized its own products. The aesthetics of our new experiences on iOS, Android and tvOS have started to show our potential from a user experience standpoint but the infrastructure that underpins these applications are mission critical to delivering on our long-term ambitions.
But building an industry-leading video experience isn’t just about great content and a slick user interface. It also requires a robust data backend capable of analyzing trends, parsing audience preferences, and measuring user engagement in a wide variety of contexts and scenarios. In the past, Vevo’s product and engineering teams were handicapped by a lack of data. Third-party analytics gave a fleeting glimpse into performance of applications, but didn’t offer the information we needed to build a user-centric, data-driven platform. Relying on vanity metrics like view counts as evidence for an application’s success risks conflating the success of the application with the growing popularity of music videos. Up until recently, Vevo lacked the key infrastructure it needed to build a world-class, in-house data platform. Not anymore.
The diagram above shows a high level overview of the new system and its capabilities. Initial data from the various Vevo applications is sent to an event collection system, where it’s paired with timestamp and geolocation data. Once this is done, the enriched data is funneled into an Amazon Kinesis stream. Kinesis is part of Amazon’s streaming data platform and is used to collect and process large amounts of data in real time.
In Kinesis parlance, a producer is any application or process that puts data into a Kinesis stream, while a consumer (also known as an Amazon Kinesis Streams application) processes these records. Two key consumers of this data are Spark/EMR for ETL, RealTime/Batch Analytics and an Amazon Redshift cluster which serves as Vevo’s in-house analytics warehouse. We load nearly one billion rows of data into the cluster on a daily basis and use a dimensional model to provide users with a rich analytic experience. In addition to storing production data in Redshift, Vevo’s data team wrote a data quality framework that uses Amazon Aurora to store SQL data checks and results. The ETL and quality checks are run on Apache Spark and orchestrated with Apache Airflow.
One of the major goals of this initiative was to consolidate various third-party data streams into a single, centralized warehouse under Vevo’s control. Not only did this allow us to control our own reporting, it also eliminated service charges. “Additional access to data has given us invaluable insights, Bryan Blair, Vevo’s VP of Ad Operations, noted. “The doors were blown wide open to create custom dashboards for anyone to instantly go in and see and assess what is going in our ad delivery landscape, something we have never been able to do until now).”
Vevo’s primary reporting tools, Looker and Tableau, required little or no setup when connecting to the dimensional models implemented on Redshift. In short order, users are up-and-running quickly able to investigate the models, make discoveries from the data, and create useful reports.
Armed with a custom-tailored data warehouse, Vevo can analyze user behaviors and how they interact with content at a deeper level and make calculated, informed decisions. The underlying insights that drive high-level business Key Performance Indicators (KPIs) are much easier to see within our expanded dataset. Under the old system, Vevo’s in-house analysts were required to spend a significant amount of time teasing meaningful conclusions out of rudimentary data. Today, they’re able to devote that time to improving the product itself.
In the past, we’ve relied upon high-level KPIs, like page views, streams, or active user metrics. While these metrics are still relevant in certain contexts, they don’t tell us anything about how often users rely on specific app features, or anything about the devices and context in which people use our product. Without this granular information, roadmap prioritization and new feature development was extremely difficult. Music industry insiders and Vevo’s own Product, Design, and Engineering teams are all passionate about music, but the application experience needs to be built around the users’ interests and preferences, not our own.
People from different countries and cultures may interact with the same application in different ways, and our new data platform will allow us to tailor our features and responses to fit their needs. Our new Redshift cluster allows us to import historical data and analyze trends far more effectively than before, which will help ensure that our future products adapt more quickly to emerging talent, usage patterns, or music video preferences. We’re using this information to move to a new method of software and product development — one that allows us to improve the product immediately in response to users rather than using the waterfall model.
Vevo’s new user-centric, data-driven approach is already paying dividends. It’s led to a more efficient and less reactive philosophy that allows us to harness and direct resources where they’ll be most efficient. Gathering more data and combining it in an in-house data warehouse clarifies relationships, reducing the need to lean on uncertain hypotheses to explain new trends or usage patterns. This, in turn, has allowed the Product team to deliver specific goals with clear definitions and requirements.
Because data, metrics, and insights are an organization- wide requirement, democratizing data has allowed Vevo to increase transparency and collaboration, unify traditionally siloed departments and geographically diverse offices, and provide context on how disparate functions contribute to common, high level goals. We’ve also implemented our own Data Quality monitoring program, which allows us to address and solve any issues quickly and efficiently. This helps ensure our data is trustworthy and can be relied upon.
The Vevo data team has championed a concept called Data Wednesdays to help other teams adapt to the new world of democratized data. Every Wednesday, the data, product, and research teams meet together and walk through the process of using our new warehouse to find answers to certain product questions via the data. The product team is becoming so proficient that they’re writing now their own SQL queries against the warehouse, and even creating their own reports and visualizations.
While this post has focused mostly on our new data platform, there are tangible benefits for end-users as well. If a bug slips through the QA process, we can identify when and why it happens, and engineer a solution more quickly. Making it easier for our internal teams to access data lets us understand when a new feature has caught on, or whether an app update has succeeded in making a capability easier to discover, use, and enjoy. Conversely, if a new design has the unintended consequence of making a feature more difficult to use, we’ll see that happening, too.
We’ll be discussing how this data platform has been integrated into Vevo’s native apps and the benefits it provides to end-users in a future post. While Video Views and Streams remain important metrics, we’ve replaced our reliance on these vanity metrics with deep insights that allow us to improve features and iterate more quickly. And Amazon Redshift will continue to serve as the catalyst for unlocking the underlying relationship between music and video to deliver an amazing user experience.