Data Cleansing and Warehouses with XML
Deficiencies in the quality of information are considered one of the most pressing problems in enterprise wide IT operations. Recent studies from IDC analysts reveal that 20% of all data is of insufficient quality impeding the associated business processes. Common problems of dirty data are wrong entries caused by typos, incompatible variable domains, missing entries as well as the absence of enterprise wide coding conventions or international standards. Data Cleansing comprises a number of methods that address the analysis and cleansing of inconsistent data in order to achieve a homogeneous data pool.
A survey of 1648 companies by IDC analysts has revealed that data cleansing and data quality is considered the second most urgent IT problem in 2003, right behind budget cuts. An analysis of the Cutter consortium shows that 77% of all companies use in house developments for data cleansing processes. These are usually application programs tailored to the proprietary formats of dirty data, and thus, they cause significant maintenance overhead, are not future-proof, and unreasonably expensive.
The eXtensible Markup Language (XML) opens new perspectives to simplify data cleansing processes.
The expressive power of XML allows the representation of relational data, EDI formats, etc. directly, without information loss in a uniform syntax and the subsequent cleansing, enrichment, and combination over multiple stages. Thereby, each step constitutes a logical unit that is developed as a self-contained transformation and maintained independently. The product Infonyte DB with its persistent processing architecture is designed for complex data cleansing processes involving multiple steps.
In a first step, the source data is converted into a straight forward XML format. This "XMLization" should be lossless, explicating the structure while preserving the content of the source data. Infonyte comes with XML-optimized indexing and storage capabilities with very fast bulk load and index creation, and the ability to process arbitrary schemaless data without loss of performance. This results in reduced development effort for data conversions, and reduced maintenance effort for the implementation of structural changes. In the following steps of the cleansing process, Infonyte-DB can act as XML warehouse decoupled from the operational business processes.
Subsequent cleansing tasks like format conversions, detection and elimination of duplicates, or referential integrity checks are supported by Infonyte's persistent XSLT processor. In contrast to main memory based XSLT solutions it has no limitation in the data volumes that can be processed. Complex join operations (grouping, process tracing) requiring multiple scans of the data are accelerated through specific index structures.
After the data of each individual data source is prepared, cleansing steps involving multiple sources like the completion of reference structures among sources can be performed. Likewise, Infonyte DB affords the integration of data from different sources in order to enable complex migration processes. With extensions of the existing XML query and transformation standards very complex cleansing and alignment operations like groupings can be realized which otherwise require heavyweight OLAP enabled systems.
In contrast to the commonly developed "All in one step" programs for data cleansing operations, the Infonyte approach results in reduced development costs by re-using the individual cleansing steps, and reduced maintenance costs through modularity, and the usage of open standards.
Due to its modularity, its platform independence (100% Java), and its webservice interface Infonyte DB integrates seamlessly with existing IT-architectures. It supports the prevalent standards for XML processing in a scalable way, and thus, constitutes a future-proof solution for the creation of complex data cleansing processes.
Contact:
info@infonyte.com