ViSTA-TV is processing a vast amount of different data formats at various time scales: EPG data that comes mostly in a rich relational data format as well as annotations represented as triples, simple key value pairs that need to be processed under real time constraints, aggregated output data which needs to be queried flexibly for presentation purposes in dashboards and applications, and ontologically represented knowledge.
All these data collections in their respective processing contexts pose very different requirements on the technologies used for storage and retrieval. These are the choices we made for the ViSTA-TV platform:
- The basics: MySQL for relational data. Though ViSTA-TV focuses on processing high-velocity data, much of the utility data exist on a more or less static time scale: electronic programme guide, shows, movies, actors, ratings, etc. Here, performance is not so much of an issue as are a rich query language and ease of use. We hence decided for MySQL as a relational data storage.
- High speed: Cassandra NoSQL database for realtime user logs. When it comes to processing live user log data including events like users joining or leaving channels, speed is everything, and trading it in for a rich query language is not an option. We therefore chose the Cassandra NoSQL database to store usage logs in the form of key value pairs.
- Slice and dice: Mondrian OLAP server for multidimensional data. Once data is processed, it eventually needs to be accessed by front end applications like dashboards and other apps. Here, querying speed is the critical factor which means that we have to prepare the service layer with the actual application requirements in mind. An OLAP server is well suited to query and aggregate in a flexible yet efficient manner.
Whereas each of the above choices has a good motivation for its individual use case, it also means that clients to these need to speak in different languages, often introducing heavyweight dependencies. In order to abstract from the actual implementation details of the data warehouse, we therefore decided to add an abstraction layer which serves all of the requests through a simple REST interface to which all clients can connect.
This interface is implemented by a RapidAnalytics server which under the hood manages connections to relational database systems, Cassandra, and the OLAP server. To that end, we are using its built-in mechanism to create user-defined Web services using the point-and-click interface of RapidMiner which in turn comes with various connectors to all of the above.
In addition to being an interface for the processing engine and apps, RapidAnalytics is also used for scheduled execution of data integration workflows, for example for the integration with EPG data sources, data enrichment, feature extraction and model generation.