Streamline Your Data Pipeline with ADDP

  1. Blog
  2. Pipeline
  3. Box-SandBox and DEVELOPMENT
  4. Transformation and transmission of digital data - TTDD
  5. Data Quality
  6. Data visualization - Visualize
  7. Security and governance
  8. Integration and extensibility
  9. Deployment options
  10. Support and training

Blog

The ADDP blog serves as a powerful platform for employees to share their knowledge and expertise regarding the platform. With the ability to write and publish instructional content and engage in discussions, the blog serves as an essential resource for staying up-to-date with the latest developments and best practices.

The Power of ADDP Blog: Sharing Knowledge and Best Practices

All posts are subject to approval from an administrator and are open for comments, allowing for valuable feedback and collaboration among users.

Users can easily access their own posts via their profile and search for posts across the entire platform. Moreover, user profiles offer additional valuable information, such as uploaded files, followed users, and followers, fostering a sense of community and connection.

Pipeline

The ADDP pipeline is a connector to a variety of data sources, including:

  • Databases like:
    • PostgreSQL, MySQL, Oracle, Microsoft SQL Server, MongoDB, Redis, IBM DB2, MariaDB, Elasticsearch, Cassandra, SQLite, OrientDB, DynamoDB, Neo4j, and Firebird SQL.
  • As well as legacy systems like:
    • SAP, IBM Cognos, SAS, Informatica, and Teradata.
  • SaaS applications like:
    • Google Analytics, Microsoft Power BI, Salesforce Analytics, Tableau, Amazon Web Services (AWS) Analytics, and IBM Watson Analytics are also supported,
  • As well as Web Services like:
    • Google BigQuery, Apache Hadoop, Apache Spark, RStudio Server, and KNIME analytical platform.
  • Finally, files in formats like:
    • text (txt), comma-separated values (csv), tab-separated values (tsv), JavaScript Object Notation (JSON), Key Lines Separated by Kommas (KLSK), Hierarchical Data Format version 5 (HDF5 or H5), Tagged Image File Format (TIFF), Waveform Audio File Format (WAV), MPEG-1 Audio Layer 3 (MP3), Advanced Audio Coding (AAC), MPEG-4 Part 14 (MP4), Audio Video Interleave (AVI), and QuickTime File Format (MOV) can also be used.

After defining the data source and configuring it, users can select the objects (tables) they want to collect data from and specify the output path in the raw layer where they want it to go. Finally, users can schedule a data ingest frequency preference.

In order to extract data from databases, the ADDP pipeline follows a set of rules.

Database sources

For example, to extract data from MySQL, the pipeline first connects to the MySQL server using the appropriate credentials. Then, it queries the MySQL server to extract the data from the selected tables, using SQL statements.

Finally, it writes the extracted data to the specified output path. Similar rules are followed for other databases and data sources supported by the ADDP pipeline.

ADDP Pipeline

Box-SandBox and DEVELOPMENT

The Box-SandBox and Development features of ADDP provide users with a powerful set of tools to work with data in various ways.

ADDP SandBox

Box is an advanced platform that is built on top of the JupiterLab environment, providing technical users with the ability to write their own Python scripts for a wide range of data-related tasks.

  • This includes:
    • exploratory data analysis,
    • data cleaning and transformation,
    • statistical modeling,
    • data visualization,
    • machine learning, and more.

Benefits of ADDP SandBox

The resulting data can then be stored in the data lake service layer and accessed from the Visualize feature, where it can be viewed in a more user-friendly, visual format.

In the Development section, users can write queries for the database and use a generator to transform unstructured files into a format that can be accessed with SQL. This feature is particularly useful for dealing with files that may not conform to traditional data formats. It allows users to extract meaningful data and incorporate it into their analysis pipeline.

All files in Box are public only if the owner of the file allows it.

Sources

Additionally, users can easily identify the owner of a file and reach out to them with questions or comments. With these powerful features, ADDP enables users to explore, manipulate, and transform their data in a flexible and intuitive way.

Transformation and transmission of digital data - TTDD

The Transformation and Transmission of Digital Data (TTDD) module is a critical component of any data management system, and the Jobs feature is its powerful tool that allows users to create complex data transformation workflows. This section highlights the many benefits and advantages of using Jobs within TTDD.

Data Transfer Job

  • First, Jobs come with several templates, including:
    • backup and restoration options,
    • executing SQL queries, and
    • backing up SQL results and posting them to an API server.

The master job is designed to create more complex transformations by combining multiple jobs.

Run all jobs from the master in parallel.

Master Jobs

Jobs also have the ability to specify the number of parallel executions, which can significantly speed up the transformation and data transfer processes, allowing users to process large amounts of data quickly.

If a job fails during execution, the next startup will continue where it left off to ensure maximum efficiency.

But the most important feature of Jobs is the continue where it left off functionality, which ensures maximum efficiency and minimizes errors. If a job fails during execution, the system will stop the entire process and notify the user of the error. This prevents the possibility of further errors and ensures that the user is aware of the problem. What’s more, the system is designed to resume the process where it left off, rather than starting from the beginning. This saves time and effort because the system does not repeat the steps that it has already completed. This option is flexible, allowing users to turn it on or off, depending on their preferences. Moreover, even if some data is missing, the system will still continue with the remaining tasks. This feature enables users to manage riskier tasks and make decisions with confidence, knowing that they have the support of ADDP’s resilient job execution system.

The ability to enable or disable jobs.

Another useful functionality of the TTDD module is the ability to enable or disable jobs. This feature is particularly beneficial when working with the master job, which involves a complex process of data transformation and transfer. Users can disable a particular job temporarily and enable it again later when needed, without having to create a new process or delete individual jobs from the workflow. This way, users can save time and streamline the data transformation and transmission process by managing jobs more efficiently.

Finally, Jobs also offer the ability to configure job settings to prioritize certain tasks. Users can specify sending data to a particular API server before continuing with other tasks. This feature allows users to optimize their data processing workflows and ensure critical tasks are completed first.

Overall, Jobs are a powerful and flexible tool within the TTDD module that enables users to manage complex data transformation workflows efficiently and effectively. The continue where it left off feature, the ability to enable/disable jobs, and the capability to prioritize certain tasks are just a few of the many benefits that make Jobs an essential component of any data management system.

Data Quality

The Data Quality module is a crucial component of the Advanced Data Development Platform (ADDP) that enables the establishment of rules to ensure data quality checks during data transmission. This module allows for predefining rules to verify data quality during transmission, thus guaranteeing the accuracy and reliability of transmitted data.

ADDP Data Quality Dimansions

Each of the templates used in the jobs can include rules during data transmission, and in the case of more complex processes such as the master, each job can have its own set of rules. When the data is transmitted, these rules are applied to check whether the data complies with the defined rules. If any data violates the rules, the system alerts us to this.

Notifications are also an integral part of the rule set and inform all users involved about the rules. This way, we are always up to date with the results of the data quality check.

Data Quality Rule Engine

For example, in a banking scenario, the Data Quality module could be used to verify whether clients have monthly payments higher than 5000€ or if all clients have paid their low-risk category loan obligations by the 1st of the month. In the medical field, if data is transmitted every couple of minutes, the Data Quality module would help us check whether patients with certain diagnoses and parameters have high blood pressure or whether all patients with severe illnesses have been attended to this month, among other things.

This module can be applied across industries, including sales, insurance, and other areas, where data plays a critical role. By defining rules for data quality checks, the Data Quality module ensures that data is accurate and reliable, allowing businesses to make better-informed decisions and optimize their operations.

The implementation of data quality rules across various data types can yield benefits in multiple domains, including banking, medicine, sales, insurance, and others. By ensuring accuracy and reliability of data through the data quality module, decision-making and process optimization can be improved. Moreover, the module’s ability to monitor and track all processes, including their history based on data and rules, provides a visual representation of the state of all processes, making it an integral part of the Advanced Data Development Platform (ADDP).

Data visualization - Visualize

Visualize is an ADDP feature that enables users to easily visualize their data. It is integrated with Box&Development, where users can store data in the data lake service layer and call it from Visualize. Visualize provides a variety of visualization types such as bar charts, line charts, scatter plots, heat maps, and more, allowing users to select the most appropriate type for their data. Users can also customize their visualizations with various options for color, labels, and legends.

Data Visualization

In addition to the standard visualizations, Visualize also supports interactive dashboards, where users can create multiple visualizations on one page, and add filters and other controls to enable dynamic interactions with the data. These dashboards can be shared with other users, either within the organization or externally, and can be embedded in other applications or websites.

Visualize also provides tools for data exploration and analysis, such as the ability to pivot and aggregate data, and to create calculated fields based on existing data. Users can also create alerts and notifications based on their data, so they can be notified when certain conditions are met.

Security and governance

ADDP includes several features for security and governance, to ensure that data is protected and used appropriately. These features include:

  • Access control: ADDP provides granular access control, allowing administrators to define roles and permissions for different users and groups, and to control access to specific data sources, pipelines, and jobs.
  • Encryption: ADDP supports encryption of data in transit and at rest, using industry-standard encryption algorithms.
  • Auditing: ADDP logs all user activity, including access to data sources, pipelines, and jobs, as well as changes to system settings and configurations.
  • Compliance: ADDP is designed to comply with industry regulations such as GDPR and HIPAA, and provides tools to help organizations meet their compliance requirements.
  • Data lineage: ADDP tracks the lineage of data, from its source to its final destination, providing transparency and traceability for data governance and compliance.
  • Integration with enterprise systems: ADDP can integrate with enterprise systems such as Active Directory, LDAP, and SAML, to provide seamless authentication and authorization.

Data Security

Integration and extensibility

ADDP is designed to be highly extensible and customizable, with a variety of integration points and APIs. This allows organizations to integrate ADDP with their existing systems and applications, and to extend its functionality to meet their specific needs.

ADDP API

ADDP provides APIs for accessing data sources, creating and managing pipelines and jobs, and querying data. It also supports integration with third-party systems and applications, such as BI tools, data science platforms, and cloud services.

ADDP also supports custom connectors, allowing organizations to create their own connectors for data sources that are not already supported. This makes it easy to integrate with legacy systems or specialized data sources.

Deployment options

ADDP can be deployed in a variety of environments, including on-premises, in the cloud, or in a hybrid environment. It supports containerization with Docker and Kubernetes, making it easy to deploy and manage in a containerized environment. ADDP also supports high availability and scalability, with built-in clustering and load balancing features.

Support and training

ADDP includes comprehensive documentation and support resources, including user guides, tutorials, and a knowledge base. It also provides professional services and training, to help organizations get up and running with ADDP quickly and efficiently. Organizations can also access a community of users and experts, to share best practices, ask questions, and get support.

In summary, Advanced Data Development Platform (ADDP) is a distributed data pipeline solution that provides a variety of features for data transformation and transmission, data visualization, security and governance, integration and extensibility, and deployment options. Its modular architecture and compatibility with a wide range of data sources, databases, SaaS applications, and web services make it a versatile and powerful tool for managing data pipelines.

The platform’s ability to handle both batch and real-time data processing and its support for big data technologies like Apache Hadoop and Apache Spark also make it a great choice for organizations with diverse data needs. With its user-friendly interface, robust job management tools, and comprehensive support resources, ADDP is an excellent choice for organizations looking to streamline their data management processes and gain valuable insights from their data.

Knowledge Sharing Pipeline: A Visual Guide to Support and Training

Go back

Comments 💬