Market Guide for Self-Service Data Preparation

Discussion in 'Giải pháp ERP, CRM, EPM and BI' started by bsdinsight, Nov 6, 2016.

  1. bsdinsight

    bsdinsight Well-Known Member

    This report profiles 36 self-service data preparation products used by analysts and data scientists to accelerate data preparation for analysis, and increasingly by data engineers in data and analytics teams to create trusted, agile, curated data for a range of distributed analytics content authors.


    Key Findings
    • The trend toward ease of use and agility that has disrupted the BI and analytics and advanced analytics markets is also occurring for data integration for analytics.
    • Most vendor offerings support broad data management capabilities, including interactive data preparation; data exploration, transformation, modeling and curation; and metadata support. Some also offer cataloging, enrichment and intelligent capabilities.
    • The market is crowded with a range vendor choices, from stand-alone specialists to vendors that embed these tools into BI and analytics, data science and/or data integration platforms.
    • Although accelerating the shift toward broadly deployed modern, agile BI and advanced analytics, these tools if unchecked can introduce multiple versions of the truth.
    Data and analytics leaders should:

    • Develop a deployment strategy for self-service data preparation to enhance user understanding of data and reduce data preparation efforts. Evaluate all vendors and choose the most appropriate offering based on capabilities, integration points, pricing and roadmaps.
    • Consider creating a new role — data engineer — to support the rapid, agile creation of curated, trusted datasets for distributed analytics content authors to promote reusable artifacts that enable bimodal and citizen integrator strategies.
    • Establish guidelines and processes for when business users should transform data sources for exploration (Mode 2) and/or when the task should be promoted to centralized IT teams for formal data integration (Mode 1).
    • Create a process and guidelines for vetting and reusing models developed by business users and incorporating them into the enterprise data integration workflow. Recognize that self-service data preparation tools will not yet replace the need for robust data integration/extraction, transformation and loading solutions for all requirements.
    Strategic Planning Assumptions

    By 2020, self-service data preparation tools will be used in more than 50% new data integration efforts for analytics.

    By 2019, data and analytics organizations that provide agile, curated internal and external datasets for a range of content authors will realize twice the business benefits of those that do not.

    By 2019, more than 80% stand-alone data integration, data discovery and advanced analytics vendors will expand their capabilities portfolio to include self-service data preparation capabilities for a range of users.

    By 2020, 85% of the new BI platform spend by large and midsize organizations globally will be on modern BI platforms.

    Market Definition

    This document was revised on 30 August 2016. The document you are viewing is the corrected version. For more information, see the Corrections page on

    Analytics users spend the majority of their time either preparing their data for analysis or waiting for data to be prepared for them. Self-service data preparation is an iterative agile process for exploring, combining, cleaning and transforming raw data into curated datasets for data science, data discovery, and BI and analytics (see Figure 1).

    While a smaller segment of overall use, self-service data preparation is also being used for cleansing data in a data migration effort.

    Figure 1. Overview of Self-Service Data Preparation


    Source: Gartner (August 2016)
    Last edited: Nov 6, 2016
  2. Loading...

    Similar Threads Forum Date
    KPIS đánh giá khách hàng - marketing - kinh doanh Phân phối & Bán lẻ Aug 6, 2017
    Xu hướng marketing mới: Content curation Quản trị doanh nghiệp Feb 4, 2017
    Trade marketing – tiếp thị tại điểm bán đối với người bán lẻ Phân phối & Bán lẻ Nov 18, 2016
    Connect directly to your Marketo data in Tableau Tableau Nov 15, 2016
    5 bí quyết email marketing hiệu quả trong 2016 Quản trị doanh nghiệp Nov 13, 2016

  3. bsdinsight

    bsdinsight Well-Known Member

    Business demands for faster and deeper insights from a broader range of data sources have driven the rapid growth of modern BI and analytics (BI&A) platforms, including more-pervasive self-service, and expanded adoption of advanced analytics platforms used by both specialist and citizen data scientists. However, as more and more users want to rapidly analyze complex combinations of data (for example, internal transaction and clickstream data, with weather or other open data or premium datasets), existing data integration approaches are often too rigid and time-consuming to keep up with demand. The escalating challenges associated with larger and more-diverse data are further contributing to the demand for adaptive and easy-to-use data preparation tools. Some of these challenges include:
    • Inability to derive value from data due to a lack of understanding of the data.
    • A lack of trust in the data due to data quality issues and shortage of necessary metadata.
    • The variety of data formats, which impedes timely exploration.
    • Inability to share data due to the potential risk of identifying personal and sensitive information buried in the data assets.
    • An increasing amount of data from unknown origin.
    • More data coming from a broader set of sources inside and outside of the organization, including open data and third-party sources from data brokers.
    • The greater challenge of governance of trusted, accurate results, as distributed analytics content authoring by analysts, data scientists and citizen data scientists expands in the enterprise.
    • As organizations are accelerating their plans to become more agile and flexible, the need to quickly prepare, explore and garner operational insights faster has become a key imperative. These challenges have made data preparation one of the biggest roadblocks to pervasive and trusted modern analytics.

    Self-service data preparation addresses these challenges by accelerating the way users find, access, clean, combine, model and transform, and collaborate with data in an agile yet trusted way. There are three emerging workflows and uses cases for self-service data preparation (see Figure 2):

    1. In analytics workflow: Business analysts and data scientists use self-service data preparation capabilities, embedded in modern BI&A and advanced analytics platforms, to iteratively access, combine, prepare and manipulate data as well as collaborate with others as part of the analytic process. They often access and blend unmodeled enterprise, cloud, data lake and personal data, including flat files. Alternatively, they combine it with IT-provisioned system-of-record data in data warehouses and data marts, prepared by existing data integration approaches.

    2. Extend analytics workflow: Analysts and data scientists leverage specialist, stand-alone platforms when data integration and harmonization requirements exceed the capabilities of tools embedded in analytics platforms. In this case, the self-service data preparation tool is target-agnostic, but often includes the modern BI&A and data science platform or a data lake. Many data scientists also supplement embedded capabilities with multiple open-source tools, such as Python (pandas) and R (data frame), or tools that come with Hadoop distributions (such as Cloudera) for different parts of the data pipeline process, such as data staging and integrating data into data lakes. However, these tools require coding and expertise. Stand-alone self-service data preparation tools can be used to accelerate and streamline the entire process in a single tool.

    3. Enable curated data: Self-service data preparation tools have initially been used primarily by analysts and data scientists to expedite the preparation of data as part of iterative analysis. However, the role of data engineer, a centralized data specialist, is emerging in IT, and is using these tools to speed the creation of curated, trusted data for a range of distributed analytics content authors. As self-service and advanced analytics become more pervasive, data and analytics leaders have an opportunity to shift and expand their teams' role from being BI content creators and data access controllers to user and data enablers with this new data engineer role. Stand-alone self-service data preparation tools that integrate with a range of front-end tools and data sources can accelerate how quickly data engineers in centralized teams and analytics competency centers (ACEs) can provide curated datasets for business analysts, data scientists or BI developers in IT. This approach also requires a process and guidelines for how business analysts and data scientists can "promote" or operationalize findings and models into trusted, recurring analysis, either themselves or via formal data integration practices. In cases where data sources are stable and won't frequently change, user-generated content can be promoted or implemented using system-of-record, existing data integration approaches.
    As distributed content authoring and self-service data preparation proliferate, it is important that data lineage and metadata are accessible to content consumers, to foster trust and auditability (see "Certify Data to Foster Trust and Consistent Use" and "Embrace Self-Service Data Preparation Tools for Agility, but Govern to Avoid Data Chaos" ). Self-service data preparation tools can help address this requirement.

    Data and analytics leaders must also put a process in place to:
    • Identify and certify trusted data sources
    • Formalize roles and responsibilities between users and IT for certifying, using, sharing and promoting of data
    • Establish quality metrics
    • Watermark outputs
    Figure 2. Self-Service Data Preparation Workflows

    BI&A = business intelligence and analytics

    Source: Gartner (August 2016)
  4. bsdinsight

    bsdinsight Well-Known Member

    Much like modern BI platforms have similar capabilities to traditional BI and reporting platforms, self-service data preparation tools have many similar features and functions to existing data integration tools. But who can use them, the user experience, how data is modeled and integrated, the types of data, and the agility for change vary significantly (see Figure 3).

    Figure 3. Differences Between Existing Data Integration and Self-Service Data Preparation

    Source: Gartner (August 2016)

    Self-service data preparation tools offer 10 major capabilities, listed below. Self-service data preparation vendors generally support most of the high-level, core data preparation capabilities listed below, or might specialize in specific core capabilities such as cataloging. All vendors profiled in this Market Guide support eight or more of these capabilities. Please see "Toolkit: Self-Service Data Preparation," which provides a custom-filtered list of capabilities offered by self-service data preparation vendors. Business analytics leaders will be able to shortlist vendors based on deployment models, target roles, supported capabilities, natively supported data sources and geographic region.

    1. Data exploration and profiling: A visual environment that enables users to interactively prepare, search, sample, profile, catalog and inventory data assets, as well as tag and annotate data for future exploration. Advanced features include autoinference, discovering and suggesting sensitive attributes, identifying commonly used attributes (for example, geodata, product ID), doing semantic reconciliation, discovering and recording data lineage of transformations, and autorecommending sources to enrich the data.

    2. Collaboration: Facilitates the sharing of queries and datasets, including publishing, sharing and promoting models with governance features, such as dataset user ratings or official watermarking.

    3. Data transformation, blending and modeling: Supports data enrichment, data mashup and blending, data cleansing, filtering, and user-defined calculations, groups and hierarchies. This includes agile data modeling/structuring that allows users to specify data types and relationships. More-advanced capabilities automatically deduce or infer the structure from the data source, and generate semantic models and ontologies, such as logical data models and hive schemas.

    4. Data curation and governance: Supports workflow for data stewardship and capabilities for data encryption, user permissions and data lineage. This also includes security features that enable governance, such as data masking, platform authentication and security filtering at the user/group/role level, as well as through integration with corporate LDAP and/or Activity Directory systems, SSO, source system security inheritance, row- and column-level security, and logging and monitoring of data usage and assets.

    5. Metadata repository and cataloging: Supports creating and searching metadata, cataloging of data sources, transformations, user activity against the data source, data source attributes, data lineage and relationships, and APIs to enable access to the metadata catalog for auditing or other uses. Through the use of analytics on the raw data, the models are derived and generated bottom up instead of designed top down. It is a continuous process of accumulating metadata based on the actual use of data. It is a living construct. This is the key difference from the ontologies and enterprise data models of the 1980s and 1990s, which were too holistic and complex to centrally design upfront. This is also a difference from data warehouse automation tools, such as Kalido and WhereScape, which automate the data warehouse development process and life cycle.

    6. Machine learning: Use of machine learning and artificial intelligence (AI) to automate and improve the self-service data preparation process.

    7. Deployment models: Platforms can be deployed either in the cloud, on-premises, or across both cloud and on-premises. This latter hybrid approach allows users to leave data on-premises in place for processing, rather than moving it to the self-service data preparation platform either in the cloud or on-premises.

    8. Domain- or vertical-specific offerings or templates: Packaged templates or offerings for domain- or vertical-specific data and models that can further accelerate time to data preparation and insight. This is particularly helpful for a number of difficult-to-use syndicated datasets.

    9. Data source access and connectivity: APIs and standards-based connectivity, including native access to cloud application and data sources, enterprise on-premises data sources, relational and unstructured data, NoSQL, Hadoop, and various file formats (XML, JSON, .csv), as well as native access to open, premium or curated data.

    10. Integration with BI&A and advanced analytics platforms: The ability to integrate harmonized, curated datasets with BI&A and advanced analytics platforms through APIs or native support for partner file formats (for example, .tde for Tableau Software, .qvd for Qlik and .pbi for Microsoft Power BI).
  5. bsdinsight

    bsdinsight Well-Known Member

    While self-service data preparation can be used for an increasing number of data integration needs for modern BI&A deployments, it does not eliminate the need for existing data integration approaches. Figure 4 highlights a decision process for when to use self-service data preparation tools versus when to use existing integration tools, or when to start with self-service data preparation then promote models to existing data integration processes. Self-service data preparation is more suited for ad hoc, iterative analyses if time to insight is more critical than upfront trust and data governance, and if the underlying data source is a complex combination of many sources that may frequently change. If the underlying data sources for a recurring analysis are relatively stable, blended data and models created with self-service data preparation tools can be promoted or rebuilt into existing data integration flows, if trust is critical. Where data governance and trust are more critical than time to insight, such as in regulatory or financial reporting or for reporting involving customer-sensitive data, existing data integration may be preferred from the start.

    Figure 4. When to Use Self-Service Data Preparation

    Source: Gartner (August 2016)
  6. bsdinsight

    bsdinsight Well-Known Member

    Market Direction

    Self-service data preparation capabilities are evolving rapidly. The success of this category is driven in large part by the success of organizations' transition to modern BI platforms and more-pervasive advanced analytics. The category also supports this drive.

    Over the next two to five years, the self-service data preparation market will mirror the shifts to business-user-oriented and agile platforms in the BI&A market. Modern BI platforms that were largely driven by the success of data discovery tools such as Tableau and Qlik initially complemented traditional BI system-of-record reporting platforms when business users demanded agile and easy-to-use tools. However, as user success with these tools spread and the tools became more enterprise-ready, more and more new analytics requirements (system-of-record, IT-produced analytics, for example) have been implemented using modern BI&A tools, because of their agility. This has resulted in higher business value for users. By 2020, 85% of new BI platform spend by large and midsize organizations globally will be on modern BI platforms that allow all users, including IT, to rapidly build analytics content to meet the demanding time to insight and changing data needs of users. Until now, data preparation has been a major road block to achieving agile and pervasive analytics. Similar to the increasing use of modern BI because of its ease of use and agility, an increasing percentage of new data integration initiatives for analytics will be implemented using agile, self-service data preparation versus existing data integration approaches, particularly as these capabilities mature and become more enterprise capable.

    The market has a large number of small, venture-funded startups that started out as Mode 2 and are now expanding to Mode 1. It also has many established data management integration vendors, including IBM, Informatica, Oracle, Pentaho, SAP and Talend. These have major product initiatives around supplementing existing integration products with stand-alone, agile and, in most cases, smart products that support an expanding range of data types, roles and use cases for Mode 2.

    By 2020, self-service data preparation tools will be used in more than 50% new data management efforts for analytics, because of their agility.

    Key product trends will include:
    • Making these tools more enterprise capable in terms of scale, data and analytics tool support, governance, reuse, security, administration, and lineage.
    • Making the data integration, harmonization and enrichment process more intelligent, automated, seamless and prescriptive to users, based on their context.
    • Making indexing, finding, sharing, relating and collaborating on datasets easier through cataloging a standard feature, creating an internal data market for internal and external datasets.
    • Support for more-diverse, larger data sources and data source combinations, including cloud and on-premises.
    • Integrating access to data services that offer curated, aggregated and monetized data for enrichment.
    • The convergence of information governance across the analytical domain (to date this takes place in the operational domain), where self-service data preparation tools will integrate with either or both MDM and information stewardship applications by ingesting the policies as set by these tools, and then publish identified violations or compliance data.
    • Convergence of capabilities with smart data discovery, smart data lakes, cataloging and next-generation data integration platforms that also support agile self-service for streaming data. A number of startup vendors such as Striim and StreamSets are pioneering this trend.
  7. bsdinsight

    bsdinsight Well-Known Member

    Market Analysis

    Note: The features and functions represented in this analysis are based on functionality presented to Gartner as of 1 June 2016. Agile release cycles by most vendors could result in near-term updates.

    The self-service data preparation market is expanding and crowded with a range of vendor choices, from stand-alone specialists to vendors that embed self-service data preparation with BI&A, advanced analytics and/or traditional data integration platforms. It will likely consolidate over the next two to four years.

    This Market Guide highlights self-service data preparation vendors that are segmented into the following categories:

    • Stand-alone self-service data preparation. Vendors in this category sell self-service data preparation stand-alone. Stand-alone vendors that specialize in data cataloging are included and highlighted, as cataloging is a key feature trend. Stand-alone vendor offerings focus on enabling tighter integration with downstream processes, such as API access and support for multiple BI tools.

    • Integrated with existing data integration platforms. Vendors here are existing data integration vendors that have added self-service data preparation to their product portfolios. They often offer some level of integration and promotability of models and data between the self-service data preparation and existing data integration tools.

    • Integrated with modern BI&A or advanced analytics/data science platforms. These integrated data preparation vendor offerings focus on data preparation capabilities as part of an end-to-end analytic process and offering, with broader BI&A, data science model and content creation capabilities.
    Data preparation is becoming critical for analytics initiatives targeting self-service analytics and self-service data discovery, which are identified in this research as the most commonly occurring use cases. Except for existing data integration vendors (for example, Informatica, Oracle and IBM), most self-service data preparation vendors focus on analytics data integration as opposed to broader data movement and MDM.

    It is important to note that basic data preparation capabilities have always been an integral part of data science/advanced analytics product offerings. These offerings have grown to standardize such capabilities and provide a self-service-oriented approach.

    Vendors that are not profiled in this Market Guide but nonetheless have or incorporate data preparation capabilities include:

    • Other stand-alone vendors such as Reltio, which offers a unique visual environment and end-to-end, self-service MDM capabilities to visually create data profiles and graph relationships with lineage; and Smartlogic, which offers semantic discovery used for cataloging.

    • Advanced analytics vendors such as SAS (SAS Enterprise Miner), RapidMiner (RapidMiner Studio), Konstanz Information Miner (KNIME — KNIME Analytics Platform) and Alpine Labs, represented in "Magic Quadrant for Advanced Analytics Platforms."

    • Smart data discovery vendors beyond IBM Watson Analytics, such as BeyondCore, DataRPM and SparkBeyond, which include intelligent, self-service data preparation as an integrated part of the intelligent and automated analytics workflow from data preparation to pattern exploration to narration of findings.

    • BI and analytics vendors on the "Magic Quadrant for Business Intelligence and Analytics Platforms" that have established but maturing embedded self-service data preparation, such as Birst, Board, Pyramid Analytics, Sisense and SAS (Enterprise Guide as well as Office Analytics).

    • BI&A vendors on the "Magic Quadrant for Business Intelligence and Analytics Platforms" that are enhancing embedded, nascent data preparation capabilities in future releases, such as Domo, GoodData, Information Builders, SAS (Visual Analytics), Salesforce (Wave Analytics) and Yellowfin.

    • Vendors that specialize in self-service data preparation for streaming data, such as Striim and StreamSets.
    The profiled vendor offerings reflect a range of maturity levels. Some of them are new in the market, released within the last 12 to 18 months and still in the process of stabilizing. Expect these offerings, focused on big data discovery and analytics, to evolve as more features and functionalities are introduced.

    It is essential to evaluate and choose the most appropriate vendor offering by considering the deployment model, end-to-end capabilities, pricing, support for data sources, end-user roles and any existing vendor platforms in which you have already invested.

Share This Page