The importance of data governance and access control has risen due to regulatory pressures and the need to share data. Data democratization puts data in the hands of as many users as possible for advanced analytics, but the enormous amounts of data and numbers of regulations involved constantly increase the need for more granular data security and access controls. As a result, two approaches have emerged to build, administer, and enforce data access policies: plug-in/policy-synchronization integration and virtualized data access architecture.
Plug-In/Policy-Synchronization Architecture
A plug-in architecture featuring native integrations with the data sources provides a lightweight footprint that is easy to layer into complex storage and compute systems. The use of plug-ins purposely built for access control was first introduced by the creation of Apache Ranger which functions as the leading security and authorization component for the Hadoop ecosystem. Because the plug-ins are natively built for the source systems, they don’t introduce added complexity, dependency, and overheads. They can swiftly authorize users to support the performance of thousands of users simultaneously accessing and querying data in production environments at petabyte scale. Today Apache Ranger/plug-in architecture is at the core of almost all modern data management and analysis tools available: EMR (Hadoop), Databricks (Apache Spark), Starburst (Trino), Confluent (Apache Kafka), DataProc (Hadoop), HDInsight (Hadoop), Dremio and many more. Due to the robustness and proven scalability of the architecture, Privacera has chosen it as the underlying architectural foundation and optimized it for access control in the hybrid- and multi-cloud environment.
Policy Synchronization is another very effective mechanism where the governance policies are pushed down to the data source/service directly. This also doesn’t add any overheads during policy enforcement. The unique advantage of this design is that it is not limited to databases and it can support any services with a native enforcement mechanism.
Virtualized Data Access Architecture
Virtualized data access products provide a single access endpoint for accessing multiple data sources and also in some cases federating queries across multiple data sources. These are great for Data Analytics and Business Intelligence use cases. Tools, such as Trino, PrestoDB, Dremio, and Denodo, thrive on this and create tremendous value for enterprise analytics initiatives. These companies have been adding multiple data sources to their connector list and spending significant amounts of their development budget on increasing the performance and supporting features like data federation.
Almost all pure play data virtualization tools use Apache Ranger plug-in architecture as their primary access control integration tool. Apache Ranger provides them the ability to make authorization decisions in milliseconds, while they can focus on their core challenges like supporting more connectors and better performance.
Access Control Solutions Adopting Virtualized Data Access Architecture
Data virtualization is a great technology for providing simplified access to diverse data stores, but security tools using virtualization, such as virtualized data access control, might not provide the right justifications and it might impede query performance at scale because these security tools are not built with performance as their primary design goal. Furthermore, they might not be able to support all data sources and different tools which are used to access the same dataset.
Also, while using virtualized data access control, imagine the impact when you are managing data spread across multi-sources with petabytes of data, hundreds of data access requests, multiple cloud services and data formats, and the constant growing enterprise demand. Using a simple analogy, it is like a professional photographer using a mobile phone with a built-in camera for a studio photo shoot. Yes, smartphones are great for point and shoot photos, but what about shooting professional-grade high-speed objects, or slow-motion images at 120fps, and what about shooting in very low light conditions, using image stabilization, swappable lenses, and more. Similarly, governance for enterprise requires enterprise grade solutions which can address all your requirements and not just a few databases and access paths/tools. The fact that virtualized data access control products were not purpose-built to address data access governance manifests into a host of symptomatic problems that impede the scalability, performance, and practicality of a data privacy and security solution.
It is natural for customers to have questions about the confusion that exists in the rapidly evolving data access control market. As a key supplier in this market, it falls on us to respond to the following misconceptions that have been shared with us by our customers on the use of virtualized data access control in an enterprise environment:
Misconception #1: Virtualized data access control customers can simply add a couple of entries to enable a database in the virtualized layer – FALSE
For virtualized data access control products to work in the real world, you need to import the metadata from the data source into the virtualized layer, which is something the vendors don’t typically bring up during their demo. Once you get into selective managing of databases and tables, the import feature will become very complicated. If you already have 1000s of tables and users currently using them, they need to update their queries, reports, code and dashboards to the namespace of the virtualized layer. Which means that every BI report, query, and application using those tables will need to be rewritten and their code tested. In other words, virtualized access control products could work for new or greenfield projects where everything starts at the ground level, but will be extremely cost/labor-prohibitive for existing environments.
Misconception #2: Virtualize data access control supports write operations – FALSE
A proper data access control solution must have the ability to natively enforce read as well as write access control for both on-prem and cloud services. For example, a business analyst might need to insert a column in a table in order to reflect a change in sales territories, but in order to perform this operation, the analyst needs permission from the administrator to both read data from the table as well as write new data into it. Read and write operations are critical for data science projects, which typically makes virtualized solutions not suitable for data science work because it is generally used for read operations only.
To overcome this design shortcoming, some virtualized data access control vendors have come up with a “workaround”. Due to the fact that they don’t have true file/object-level access control, they give blanket permissions to the workspace. In other words, everyone in their version of the project will have access to write, which defeats the purpose of having an access control solution to enforce write operations in the first place. Imagine a scenario where your data scientists and contractors both have write privileges to the data, as does everyone else in the workspace regardless of roles, locations, and varying knowledge of the data. This approach is extremely risky in terms of safeguarding and preserving the integrity of the data. In fact, the application of a control mechanism such as this will severely jeopardize data security as your data and organization grows.
Misconception #3: Virtualized data access control supports all data sources – FALSE
Many virtualized data access control products started out as BI tools, so virtualization is already applied to most of the commonly used databases. When these vendors moved into the data security space, those integrations were automatically inherited as they work well for most BI use cases. So, these vendors may be capable of supporting more connectors, but at what cost?
Virtualization-based access control solutions serve as a common logical data access point to which all the data sources need to be connected to. This proxy server architecture sits between the data sources and analytical services and relies on additional compute power for its extensions. This heavy reliance on extensive compute capability in terms of scalability and performance puts users in a lose-lose situation: having to choose between performance degradation by sticking to the budget, or rising TCO due to additional compute capacity needed. Unfortunately, neither is sustainable.
Misconception #4: Virtualized data access control has features to accommodate the growing enterprise demand – FALSE
Enterprises are looking for a more comprehensive data access governance solution to accommodate the rising and more stringent data privacy and security mandates. They are looking for a solution that supports the entire data governance value chain–encompassing sensitive data discovery, classification, masking and encryption, granular access control and more–in an exponentially growing data environment.
The virtualized-based access control vendors in the market may be expanding or increasing their development efforts in bulking up their product capability, but users and customers must not overlook a fundamental design flaw in their approach and architecture. In particular, the virtualized middle layer is tasked to perform many jobs: extensive data processing, replication of schema, authorization of all data requests to validate if users have access to data or not, and more. For example, just take a complex function like sensitive data discovery; imagine the massive backend work and compute capability needed to pull petabytes of data into a single virtualized instance for the identification and tagging of sensitive data. No wonder this often results in shocking performance degradation and scalability issues. Virtualized access control starts out envisioning to be the “one throat to choke” for many data operations, but its inability to scale and perform turns it into “a single point of failure”.
Misconception #5: Virtualized data access control supports tag-based policies – MAYBE
Many virtualized data access control vendors employ the use of security tags as a way to overcome their limitation of supporting multiple flavors of policies. However, for those familiar with tag-based policies, granting access permission based on security tags alone is not a best practice of access authorization. Imagine you have tagged an element such as SSN as sensitive and have a policy which permits everyone from HR to see the data. This is a good demo use case, but it is wrong for various reasons , such as:
- Tags are set at field or column level to restrict access. If that is the case, how would you now give permission to the table or the rest of the fields or columns in that table?
- Enterprises may have SSN details for both their employees and perhaps their customers too. Imagine a Healthcare Provider scenario, do you really want anyone from HR to see the SSN of patients?
Then how would tag-based policies work in Privacera? It is more about providing flexibility to the users in order to support their use cases. For example, there are business tags; if the table or database is tagged as containing employee data, then you can give permission to HR to read it. But for security tags, it is easier to set up negative policies, e.g. if the user who accesses is not from the US, then the user shouldn’t be able to see SSN belonging to customers, or contractors can’t see any data that contains PII.
Misconception #6: Virtualized data access control offers a notebook to run the queries – TRUE
This is true for data virtualization tools. However, we question the practicality of such a feature used for access control and governance, and the questions that we often ask our customers are:
- Do you use PowerBI or other BI tools already? These products will always have richer features and more functionality.
- Would you be running queries in Databricks or Snowflake console? If so, would you get the same level of granularity in access control, such as tag-based, resource-based, and attribute-based access control, at a database, table, column, or file level?
Closing Remarks
We want our customers to use all their favorite tools to access their data, including data virtualization tools. We work closely with the Open Source community and enterprise vendors like Datarbricks, AWS/EMR, Starburst, Dremio, and Domo to integrate our plugins natively within their products. This is a win-win for our partners, our customers, and for us. To learn more about how Privacera and Starburst partner to enable a data mesh concept that delivers rapid, federated cloud analytics with consistent governance and security, read our blog or whitepaper.