This blog briefs the Importance of Data Discipline and Governance for an organization where a large amount of data processed daily.
Data is the new Oil. Companies should treasure data. The new class of powerful Machine Learning (ML) and Artificial Intelligence (AI) algorithms are data-hungry and are as good as the data provided to them. Several useful (profitable) insights can be derived from data across the organization benefiting different departments and initiatives.
However, there are several challenges in trying to have a discipline about capture, storage, access and usage of data. Even more, this blog tries to cover some principles and advantages in data governance of a large organization.
Data and ML/AI algorithms in Data discipline Implementation
Most of the models we deal with today are statistical and are predicated on the data.
ML/AI algorithms are good at exploring and exploiting relationships when there are a large number of touchpoints/features. For example, an external customer for one department could be a vendor for another department and by looking at the information together; ML models can make use of this relationship to the advantage of the company. Similar to a vendor can be supplying two different departments.
To learn more about our AI framework models, check out Gyrus AI Framework Modules
It is paramount to have high-quality data instead of large datasets. Open-source datasets or datasets loosely connected are mostly in use for developing an openly available model. Usual open-source datasets do not have the quality from real-life use cases. Either they are not complete or they are not accurate on all counts.
In order to set up the data guideline for the company of a large organization. a chief data officer CDO or equivalent at the company level to maintain is desirable. CDO and his team can formulate the best practices, basic set of rules about the collection, storage, security, usage, and distribution. They can also have tools/software for implementing these prescribed practices.
In practice, each of the divisions could have their own data requirements, which could be different from the CDO’s generic goal. Therefore, it is highly desirable to have data engineers at the division level to address the specific needs of the division, still complying with the broader vision set by the CDO.
On the contrary, moving all the Data engineers into one group under CDO could be counter the productive in serving the needs of the divisions (though that scheme has potential to have higher efficiency (reuse), in the long run, the division needs tend to be ignored in-lieu of too broad goals)
To learn more about data governance, Check out our data discipline in large organization blog
One of the biggest challenges is that we are not able to predict the future uses of data. New models can use data in a completely different way. So there are two ways of collecting data with some trade-offs.
a) Collect all events with time stamps
1. This is storage intensive as we collect all the events coming in.
2. However, provides flexibility to have any algorithms developed in the future with this data
b) Collect processed events
a. With the intuition/prior history of what sort of signals is important, the signals are derived and stored.
b. Naturally, this takes lesser disk space.
An example of a model is whether to store the complete video in a surveillance case or collect important signals to when there is an action takes place in a video or to store and record only the actions in textual form.
Seeing the value of data, several services, and ML/AI model development organizations are willing to work with companies, which have data/datasets with the idea to take back the learned model. The learned model is in a way a representation of the data used to learn it and it can be licensed to other companies in a similar field. This is a huge disadvantage to the Organization giving data as they will be empowering their competitor with their data. It is the responsibility of every organization to protect its data and derived models.
Vendor-specific data is proprietary and is governed by Non-Disclosure agreements with huge ramifications. The models derived from this data can inadvertently leak the base information. So transferring models to external parties can be a violation of confidentiality of the vendor data. Differential privacy is a technique in ML/AI that tests and prevents leaking individual information through models. Moreover, the administration of models is essential to avoid leakage of information, whereas models given out for external usage. In general quality, datasets have become so much value that there are companies willing to pay for datasets either with direct $$ or using Crypto-Currency.
As a result of Data discipline, Quality data powers ML/AI algorithms. Companies should have a data policy across the company and at each department level to capture, store, protect, and use data. Models derived out of this data has the potential to contain the vendor/partner data and proprietary information. Therefore, the protection of models is important, for the reason that not to lose powerful information to competitors.