Introduction
DataMeshX is an accelerator used for data ingestion automation. It uses Apache Airflow for the scheduling, orchestration and monitoring.
Run the following commands in your terminal:
Our demo source will pull data from an external API, which will pull down the information on one specified Pokémon.
- Postgres
- Azure SQL
- MySQL
- DB2
- Sybase
The destination we are creating is a simple JSON line file, meaning that it will contain one JSON object per line. Each objects will represent data extracted from the source
- S3
- Snowflake
- Redshift
- Oracle DB
- Postgres
- Azure SQL
- DB2
- Sybase
- Databricks
Set Up Connection
Steps for adding new connections and source databases for ingestions are as follows:
- Add new Source Database Properties
- Set Values to Source Database Properties
- Run Database Schema Crawler
- Add New Ingestion Based on User Requirement
- Scheduling Custom Notebooks
Add new Source Database Properties
On the left pane of the DataMeshX Web Application under ‘DataMeshX Configuration’ you can see a link called ‘Source Types’.
Clicking on that will direct you to a page where you can define properties of a specific type of a database.
These properties are specific to a database e.g. Postgres connection properties will differ from those of Snowflake Database connection properties.
As shown in the above example both properties of Azure SQL Database and PostgresDB differ and need to be properly defined during this step.
Note: Instance Name is not required for any database, that property is dependent on user’s requirements on the target folder structure.
Set Values to Source Database Properties
You can define the connection properties in the page called ‘Source Database’. Based on the properties you’ve defined in the last step, you can specify the actual values for these properties in this step.
Connections Configuration for Hercules Prod Database:
Connection details can be defined as above after choosing your specific source database type. Once the required details are added, the connection can be tested by clicking on the ‘Test’ button on the bottom left side.
Important: The default Ingestion Mode needs to be defined.
Run Database Schema Crawler
The database crawler can be accessed from the page called ‘Run Database Crawler’ on the left pane. In there you can select your database from the down option and click on ‘Run Crawler’
Upon clicking it will prompt the user to select one of the schema’s from the source database, whose metadata information needs to be retrieved and stored in DataMeshX’s database.
Depending on how many tables and columns are present in the schema it will take some time to crawl a schema, there will be a popup when the crawler has finished retrieving all the required information.
If the schema has already been crawled before, user will be able to compare and bring in new and updated metadata information from the schema in source database, similar to the screen below:
Add New Ingestion Based on User Requirement
Clicking on the “Ingestions’ page on the left pane will give a result similar to the figure below. Here users can see all the ingestions currently defined, their ingestion interval, last time they were ingested in the data platform, their ingestion type.
By clicking on a specific record, the user can also the logs and audit tracker information of the source table selected.
By clicking on a specific record, the user can also the logs and audit tracker information of the source table selected.
In order to create a new ingestion the user will have to click on the create button which shows the following information:
As soon as the new ingestion is created it will be picked up by the scheduler and will be ingested into the data platform in the next scheduled interval (based on the frequency defined by user)
Scheduling Custom Notebooks
Custom notebooks developed by the end users can also be scheduled by DataMeshX.
Due to the scope of the POC, the end users will need to define the schedule interval for these notebooks to the DataMeshX team and whether these notebooks/code are dependent on the outcome of the other scheduled tasks, as these activities are not automated in this scope.
There are 2 tables in the datameshx database called ‘batch_job_configuration’ and ‘job_configuration’. Together these store the metadata information of what needs to be scheduled by DataMeshX.
Structured and Example of ‘batch_job_configuration’
‘batch_job_configuration’ defines the properties of each task in a specific batch e.g. the task/job name, its dependencies etc.
‘job_configuration’ defines tasks specific properties i.e. which notebook to run, source tables, and target tables for that task.
The job_name defined in the ‘batch_job_configuration’ table must exist in ‘job_configuration’ and should not have spaces or other special characters; these must be replaced by ‘_’.
With the above structure of the tables the following tasks are automatically created with their respected dependencies in the DataMeshX scheduler:
Changing the value of the parent dependencies column in the ‘job_configuration’ table will automatically update the dependencies between each task.