Part1. Introduction to DataCleaner 介紹DataCleanerhtml
- |--What is data quality(DQ) 數據質量?
- |--What is data profiling? 數據分析?
- |--What is datastore? 數據存儲?
- Composite datastore 綜合性數據存儲
- |--What is data monitoring? 數據監控?
- |--What is master data management(MDM)? 主數據管理?
What is data quality (DQ)?
Data Quality (DQ) is a concept and a business term covering the quality of the data used for a particular purpose. Often times the DQ term is applied to the quality of data usedweb
數據質量即便一種概念又是一種用於說明特定目的包含質量數據的商業術語。不少時間DQ術語被應用到商業決策上,數據庫
in business decisions but it may also refer to the quality of data used in research, campaigns, processes and more.服務器
可是也值得是質量數據被應用到研究、質量活動,流程等等。app
Working with Data Quality typically varies a lot from project to project, just as the issues in the quality of data vary a lot. Examples of data quality issues include:less
處理數據質量一般會隨着項目和項目的不一樣而變化,就像數據質量的問題會有很大的不一樣。數據質量的問題主要有:ide
-
-
- Completeness of data 數據的完整性
- Correctness of data 數據的正確性
- Duplication of data 重複的數據
- Uniformedness/standardization of data 數據的標準性
-
A less technical definition of high-quality data is, that data are of high quality "if they are fit for their intended uses in operations, decision making and planning" (J. M. Juran).ui
對高質量數據的一個不太技術性的定義是,數據具備高質量,「若是它們適合於其在運營、決策和規劃方面的預期用途」(J. M. Juran)。this
Data quality analysis (DQA) is the (human) process of examining the quality of data for a particular process or organization. The DQA includes both technical and non-technicalidea
數據質量分析(DQA)是對特定過程或組織的數據質量進行檢查的過程。數據質量分析包括的技術元素和非技術元素。
elements. For example, to do a good DQA you will probably need to talk to users, business people, partner organizations and maybe customers.
例如,要作一個好的DQA,您可能須要與用戶、業務人員、夥伴組織和可能的客戶交談。
This is needed to asses what the goal of the DQA should be.
這是用來評估DQA目標的必要的。
From a technical viewpoint the main task in a DQA is the data profiling activity, which will help you discover and measure the current state of affairs in the data.
從技術角度來看,DQA中的主要任務是數據分析活動,它將幫助您發現和度量數據中的當前狀態。
What is data profiling?
Data profiling is the activity of investigating a datastore to create a 'profile' of it. With a profile of your datastore you will be a lot better equipped to actually use and improve it.
數據分析是對數據存儲進行調查以建立它的「概要」的活動。有了您的數據存儲的概要,您將會有更好的去實際使用和改進它。
The way you do profiling often depends on whether you already have some ideas about the quality of the data or if you're not experienced with the datastore at hand. Either
您進行分析的方式一般取決於您是否已經對數據的質量有了一些想法,或者您是否對datastore沒有經驗。
way we recommend an explorative approach, because even though you think there are only a certain amount of issues you need to look for, it is our experience (and reasoning behind a lot of the features of DataCleaner) that it is just as important to check those items in the data that you think are correct!
不管哪一種方式,咱們都建議採用一種探索性的方法,由於即便您認爲您須要查找的問題只有必定數量,但這是咱們的經驗(而且在數據收集者的許多特性後面進行推理),在您認爲正確的數據中檢查這些項一樣重要!
Typically it's cheap to include a bit more data into your analysis and the results just might surprise you and save you time!
一般,在你的分析中包含更多的數據是沒有價值的,結果可能會讓你大吃一驚,節省你的時間!
DataCleaner comprises (amongst other aspects) a desktop application for doing data profiling on just about any kind of datastore.
DataCleaner包括(在其餘方面)一個桌面應用程序,用於對任何類型的數據存儲進行數據分析。
What is a datastore?
A datastore is the place where data is stored. Usually enterprise data lives in relational databases, but there are numerous exceptions to that rule.
數據存儲是存儲數據的地方。一般企業數據都存在於關係數據庫中,可是有許多例外狀況。
To comprehend different sources of data, such as databases, spreadsheets, XML files and even standard business applications, we employ the umbrella term datastore .
由不一樣來源的數據組成,例如數據庫、電子表格、XML文件,甚至標準的業務應用程序,咱們使用的是術語數據存儲。
DataCleaner is capable of retrieving data from a very wide range of datastores. And furthermore, DataCleaner can update the data of most of these datastores as well.
DataCleaner可以從很是普遍的數據存儲中檢索數據。此外,DataCleaner還能夠更新大多數這些數據存儲的數據。
A datastore can be created in the UI or via the configuration file . You can create a datastore from any type of source such as: CSV, Excel, Oracle Database, MySQL, etc.
數據存儲能夠在UI中建立,也能夠經過配置文件建立。您能夠從任何類型的源(如:CSV、Excel、Oracle數據庫、MySQL等)建立數據存儲。
Composite datastore
A composite datastore contains multiple datastores . The main advantage of a composite datastore is that it allows you to analyze and process data from multiple sources in the same job.
複合數據存儲包含多個數據存儲。複合數據存儲的主要優點在於,它容許您在同一做業中分析和處理來自多個源的數據。
What is data monitoring?
We've argued that data profiling is ideally an explorative activity. Data monitoring typically isn't! The measurements that you do when profiling often times needs to be
continuously checked so that your improvements are enforced through time. This is what data monitoring is typically about.
咱們認爲,數據分析是一種理想的探索活動。數據監控一般不是!您在進行概要分析時所作的度量一般須要不斷地檢查,以便您的改進能夠經過時間來執行。這就是數據監控的典型特徵。
Data monitoring solutions come in different shapes and sizes. You can set up your own bulk of scheduled jobs that run every night. You can build alerts around it that send you emails if a particular measure goes beyond its allowed thresholds, or in some cases you can attempt ruling out the issue entirely by applying First-Time-Right (FTR) principles that validate data at entry-time. eg. at data registration forms and more.
數據監控解決方案有不一樣的形狀和大小。你能夠安排本身的大部分計劃的工做天天晚上運行。若是某個特定的度量超出了容許的閾值,或者在某些狀況下,您能夠經過應用第一次正確的(FTR)原則來排除這個問題,那麼您就能夠在它周圍構建警報,或者在某些狀況下,您能夠嘗試排除這個問題。如。在數據登記表格等.
As of version 3, DataCleaner now also includes a monitoring web application, dubbed "DataCleaner monitor". The monitor is a server application that supports orchestrating and scheduling of jobs, as well as exposing metrics through web services and through interactive timelines and reports. It also supports the configuration and job-building process through wizards and management pages for all the components of the solution. As such, we like to say that the DataCleaner monitor provides a good foundation for the infrastructure needed in a Master Data Management hub.
在版本3中,DataCleaner如今還包括一個監視web應用程序,稱爲「DataCleaner monitor」。monitor是一個服務器應用程序,它支持編排和調度做業,以及經過web服務和交互式時間線和報告公開指標。它還經過嚮導和管理頁面支持解決方案的全部組件的配置和工做構建過程。所以,咱們喜歡說DataCleaner monitor爲一個主數據管理中心所需的基礎設施提供了良好的基礎。
What is master data management (MDM)?
Master data management (MDM) is a very broad term and is seen materialized in a variety of ways. For the scope of this document it serves more as a context of data quality than an activity that we actually target with DataCleaner per-se.
主數據管理(MDM)是一個很是普遍的術語,它以各類方式出現。對於本文檔的範圍來講,它更像是數據質量的上下文,而不是咱們實際使用DataCleaner的活動。
The overall goals of MDM is to manage the important data of an organization. By "master data" we refer to "a single version of the truth", ie. not the data of a particular system, but for example all the customer data or product data of a company. Usually this data is dispersed over multiple datastores, so an important part of MDM is the process of unifying the data into a single model.
MDM的整體目標是管理組織的重要數據。「主數據」指的是「單一版本的真相」。不是某個特定系統的數據,而是一個公司的全部客戶數據或產品數據。一般,這些數據分散在多個數據存儲中,所以MDM的一個重要部分就是將數據統一爲一個模型的過程。
Obviously another of the very important issues to handle in MDM is the quality of data. If you simply gather eg. "all customer data" from all systems in an organization, you will most likely see a lot of data quality issues. There will be a lot of duplicate entries, there will be variances in the way that customer data is filled, there will be different identifiers and even different levels of granularity for defining "what is a customer?". In the context of MDM, DataCleaner can serve as the engine to cleanse, transform and unify data from multiple datastores into the single view of the master data.
顯然,在MDM中處理的另外一個很是重要的問題是數據的質量。若是你只是彙集。「全部客戶數據」來自組織中的全部系統,您極可能會看到大量的數據質量問題。將會有不少重複的條目,在客戶數據填充的方式上會有差別,會有不一樣的標識符,甚至是不一樣的粒度級別來定義「什麼是客戶」。在MDM環境中,DataCleaner能夠做爲引擎來清理、轉換和統一來自多個數據存儲的數據,並將其統一到主數據的單一視圖中。