The future of Pharma: harnessing AI to decentralise data
Posted 9th September 2019 by Joshua Sewell
As Chief Data Officer for the OSTHUS Group, Eric Little co-founded LeapAnalysis, a new approach to AI, data integration and analytics.
LeapAnalysis is the first fully federated and virtualised search and analytics engine that runs on semantic metadata. It allows users to combine semantic models (ontologies) with machine learning algorithms to provide customers with unparalleled flexibility in utilizing their data.
Current technologies and why LeapAnalysis is different
Nearly all technologies surrounding AI and analytics are purely statistical in nature, using algorithmic approaches that are not incredibly novel, such as decision trees, neural networks, etc. The logical framework that contextualises these things is often missing. This is captured using semantic technologies.
Semantic technologies use logical models to describe concepts, entities and relationships and put them into a structure that models the way that people naively think about the world – i.e., common sense concepts such as “nothing can be both bigger and smaller than itself at the same time”.
LeapAnalysis marries these concepts in a new and interesting way.
There’s a conceptual model of the world that’s semantic, and a statistical, math-based ability to calculate over the world algorithmically. This is in line with how the human brain works: we run logical and math-based systems in the brain concurrently. We have sets of beliefs about how the world is structured (sometimes called “the background” in philosophy of mind) and subsequently, we do all kinds of computations over and against these logical structures.
Think of crossing a street: one understands that a street is a thoroughfare for traffic. They travel at certain speeds and are made of certain materials, which, if they hit you, will cause injury. Crosswalks allowing pedestrians to cross come with certain lighting systems involving green and red, or various pictures showing who has the right of way. All of this is a background model we have learned and built over time (and is a representative for many objects and events in the world that we use and navigate).
Once we begin to cross the street we overlay many computations into this model: “How fast is that car approaching?”; “How far do I have to walk to cross the street?”; “How long has the light been green?”; “How fast do I need to walk?” etc. We seamlessly use a set of classification and calculation systems in tandem for all such tasks. Now imagine if your data systems could do the same.
LeapAnalysis uses metadata models of various types, ontologies, taxonomies, entity-relationship diagrams, etc. to build the logical representation of data sources. In this sense, it looks at questions such as: “What kind of data are you?”; “How are you accessed?”; “How are you used?”; “What kinds of relationships do you possess?” etc. We then use machine learning to run advanced algorithms on the data that can help compute new features such as time-series data, trends, clusters, indirect relationships, partial matching, feature extraction, etc.
The metadata models (logic) and the algorithms (math) work together in a way where the computational parts are informed by the logic-based models to understand what a specific classifier or data feature inside of an algorithm means and whether it is being used consistency over computations. This adds context to computations. At the same time, data that is produced computationally (e.g. a trend over time) can be saved and used logically as a ‘named graph’ or ‘business model’ giving it semantic import and providing its contextual reuse in other computational applications. In this way LeapAnalysis can understand the world conceptually, modelling the way people naturally think and function, with high scalability and performance.
Data Federation & Virtualisation
We’ve built this engine to operate on people’s data directly without needing to migrate the data. Most tools and technologies that do advanced analytics or search, especially in pharma, force you to build a common repository for the data before any analytics can begin. This means one must invest in building something like a data warehouse or data lake first, then add your analytics engine on top.
But here’s the problem: building that big data warehouse or data lake takes months to years and millions of dollars to get in place, without any guarantee it is going to provide any quality business-relevant analytics.
Not only do customers need to migrate data, but this often means they also need to make copies of the data. Customers therefore wind up with the original data sitting in the original data source and multiple copies of that data in new amalgamated systems such as warehouses or lakes. This can be even further exacerbated if one needs more copies to include in applications to run specific analytics. You can end up with multiple copies which you have to store and maintain consistency, while at the same time increasing storage costs, workloads and corporate risk.
LeapAnalysis bypasses these hurdles. There’s no ETL or pipeline needed for the data. Data stays where it is, fully disparate and federated. We have built a very sophisticated set of translators and connectors that allow us to speak the language of these data types natively, so we can talk directly to nearly any type of data source. This includes relational databases, file stores (including Excel), NoSQL stores, Triple Stores, Property Graph Stores, REST Endpoints, SPARQL Endpoints, Image Stores, Document Stores, Video Stores, and more.
LeapAnalysis makes it very simple and straightforward to connect newer and more legacy types of data in one seamless tool. This also means we can not only integrate data within a company but can also bring in open source data that resides in data sources, ontologies, etc. outside of a company’s firewall.
But the key with LeapAnalysis is the federation and the virtualisation.
Federated data means that we leave everything in place – data access, data governance, data stewardship, security policies, all are maintained on the source system.
Virtualisation allows us to build representations of data that become useful for handing things like file stores where there is little implicit structure at the data source. Virtualisation also allows us to make data more user-friendly by hiding duplicate records or anonymising certain column headers, table names, or sheets/cells of data in files. We can provide a means to allow the raw data to stay in its native format while providing ease of use to access it and manipulate it by users. The important part is that no data is physically transformed, and so no traceability is lost.
Users log into LeapAnalysis and it provides user-centric workspaces based on their role in the organisation. It knows how certain concepts/models are connected to certain users and can provide that user-specific lens or a cross-domain set of lenses to view data from different perspectives. For example, chemists and biologists may both be interested in a certain receptor site on a certain cell type – however, they may look at that receptor in very different, domain-specific ways. At some point, however, they may want to look at the data from another’s point of view and gain new insight.
LeapAnalysis provides this capability and helps to bring users together into various communities of interest while maintaining needed restrictions and access points since none of the data sources are exposed without permission from their owner. LeapAnalysis learns what things are, where they are and in which data source, and knows how to get them and use them to solve complex problems.
LeapAnalysis is highly scalable. We have optimisation engines that work over queries to constantly learn new and faster ways to serve up data. Because of this, on average LeapAnalysis adds only milliseconds onto queries.
All filtering is performed on each data source directly and can go across multiple data sources at once to pull this information together and give one complete answer at incredible speeds. The program does not pull in large sets of data and then run computations internally since this takes massive amounts of internal memory and normally results in lack of performance.
Instead, LeapAnalysis operates on the source natively, and pulls in only those data fragments it needs, so that a user can ask a very complex question that may require numerous different types of data sources and receive one very complete answer, with traceability as to which record came from which source (full traceability in one go).
Connecting data sources in this manner now takes minutes to hours rather than weeks to months. Using LeapAnalysis, a researcher can simply connect directly to any data source to which they have access. LeapAnalysis asks for the data source type, address where it is located, port to open, username/password, and that’s it. The system tests the connection and lets you know instantly if the source is connected. It then uses advanced machine learning to read the schema and produce available data elements for semi-automated alignment to the metadata model (typically a semantic knowledge graph because of their expressivity, but as stated, any model will work). LeapAnalysis can learn data structures over time making its recommendation engine improve over time and use.
LeapAnalysis allows users to build virtual on-the-fly data lakes. In doing so, it dramatically speeds up one’s ability to run advanced analytics and do the kinds of computations that up to now have not been possible without significant up-front investment in time, money and effort.
AI and the future of the industry
LeapAnalysis can be very disruptive in the AI space because we’ve designed a product that challenges the status quo on how search and analytics systems are built and deployed. We think we are very much out on the cutting edge of this revolution in AI, but also see others having similar ideas.
The traditional Hadoop frameworks that have dominated most industries for the last decade or so are already dead – even Google has abandoned this technology. There will be a rise in new kinds of computational paradigms such as GPU vs CPU computing, which may or may not be successful in the long term.
What we are going to see a lot more of in future AI systems is a rise in decentralisation and virtualisation of data. This will affect the need to build these large and expensive data lakes, data warehouses and data marts to answer complex questions in their environments. Up to now, the Achilles’ heel for advanced analytics has been financial commitment mixed with time and resource allocation, only to build systems that constantly overpromise and under deliver. Nearly all such endeavours that we encounter with customers don’t answer any of the questions that drive better business decisions. We feel at LeapAnalysis that these issues can be rapidly changed for the better and at significant cost and time savings for customers.
This type of decentralised and virtualised approach will grow in the coming years – many of the big box vendors are now also talking more about this in public and producing tools as well. We feel that we can provide people with a finished product that can leverage much of this forward-thinking and put the future in our customers’ hands today.
LeapAnalysis provides customers with the ability to build intelligent on-the-fly data lakes. Launched in March 2019, LeapAnalysis is gaining attention across several verticals including pharma, medicine, materials sciences, crop sciences, transportation and financial services.
The 3rd Global Pharma R&D Informatics & AI Congress will focus on successfully leveraging the power of AI and machine learning by utilising high-quality big data. Visit the event page and register today.
Leave a Reply