Databricks

Note
Starting with Syniti Replicate v10.7.0.4, Syniti now supports Databricks with Unity Catalog enabled as a target system.

Overview

Syniti Replicate allows you to replicate data from relational database tables to Databricks on Azure in the following replication modes:

Refresh (Snapshot Replication): A one-time complete replication from any major relational database source to Databricks as a target, according to replication settings and scripts. You can schedule the replication, identify the columns to be replicated, and add scripts to transform data during replication. Source databases include Oracle, Microsoft SQL Server, IBM Db2 for i, IBM Db2 LUW, Sybase, Informix, and MySQL.
Continuous Refresh: A regularly scheduled refresh replication as described above. The schedule is defined in the replication settings.
One-Way Mirroring (Incremental Replication): A continuous update of replicated tables based on changes to the source database that have been recorded in the database server log. Typically, this process involves an initial refresh operation, as described above, to set up the target table. Then, you can define the replication settings to check the transaction log on the source database at regular intervals. Any changes found in the log are applied to data on the Databricks platform.

Prerequisites

Connection Type: Syniti Replicate uses two type of connections:
- Simba Spark ODBC (default). Download Simba Spark driver from here.
- CData ADO.NET for Databricks. Download the Databricks driver from here.
For Databricks as a target connection, you must use an Azure Data Lake Storage (ADLS) Gen2 account, in addition to the Parquet serializer.
For ADLS Gen 2 targets, the .NET provider is included with the Syniti Replicate installation package.
For Parquet Serialization, complete the following steps:
1. If you haven't installed NuGet, download the Windows x86 Commandline version from the Nuget site and move it to a required directory. For example, D:\Temp\nuget.exe.
2. Create a local directory to download the required NuGet packages. For example, D:\Temp\Parquet.
3. From the command prompt with admin permissions, execute the below command. Make sure to execute this command from the right directory.
```
D:\Temp\nuget.exe install Parquet.Net -Version 4.10.1 -Framework net480 –OutputDirectory D:\Temp\Parquet
```
  This command will download the entire provider into the specified output directory.
4. Copy the below files from the downloaded packages and paste it into Replicate’s Plugins\ParquetSerializer directory. For example, D:\Program Files\Syniti\Syniti Replicate\Plugins\ParquetSerializer.
  - D:\Temp\Parquet\Apache.Arrow.2.0.0\lib\netcoreapp2.1\Apache.Arrow.dll
  - D:\Temp\Parquet\IronCompress.1.3.0\lib\netstandard2.0\IronCompress.dll
  - D:\Temp\Parquet\Microsoft.Data.Analysis.0.20.1\lib\netstandard2.0\Microsoft.Data.Analysis.dll
  - D:\Temp\Parquet\Microsoft.IO.RecyclableMemoryStream.2.3.2\lib\netstandard2.0 \Microsoft.IO.RecyclableMemoryStream.dll
  - D:\Temp\Parquet\Microsoft.ML.DataView.2.0.1\lib\netstandard2.0\Microsoft.ML.DataView.dll
  - D:\Temp\Parquet\Parquet.Net.4.10.1\lib\netstandard2.0\Parquet.dll
  - D:\Temp\Parquet\Snappier.1.0.0\lib\netstandard2.0\Snappier.dll
  - D:\Temp\Parquet\System.Buffers.4.5.1\lib\netstandard2.0\System.Buffers.dll
  - D:\Temp\Parquet\System.Collections.Immutable.1.5.0\lib\netstandard2.0\System.Collections.Immutable.dll
  - D:\Temp\Parquet\System.Memory.4.5.5\lib\netstandard2.0\System.Memory.dll
  - D:\Temp\Parquet\System.Numerics.Vectors.4.5.0\lib\netstandard2.0\System.Numeric.Vectors.dll
  - D:\Temp\Parquet\System.Runtime.CompilerServices.Unsafe.6.0.0\lib\netstandard2.0\System.Runtime.CompilerServices.Unsafe.dll
  - D:\Temp\Parquet\System.Threading.Tasks.Extensions.4.5.4\lib\netstandard2.0\System.Threading.Tasks.Extensions.dll
    Note
    It is possible that the packages are a higher version than mentioned above. For example, instead of System.Threading.Tasks.Extensions.4.5.4, the package can be System.Threading.Tasks.Extensions.4.5.5. In this case, copy the version that has been downloaded.

If you need additional information, refer to the setup notes for ADLS Gen 2.

Configure Databricks in Syniti Replicate

Add Target Connection Wizard

Select Provider

On the Select Provider page, enter a Target name.
Select Databricks from the Database list.
By default the provider selected is Simba Spark ODBC. Check how to install and configure a Simba Spark ODBC to be used in Syniti Replicate.

Or, you can utilize a CDATA ADO.NET for Databricks provider. This option is not displayed in a default installation, in order to show this option, Open the Management Center, go to the main menu, TOOLS and click in Options, uncheck “Use Only Default Providers”. Create the target connection again and selecting under “Provider”, you will be able to select the CDATA option.

Note
For .NET Data Provider connections, no value is required in the Assembly field.

Set Connection String

On the Set Connection String page, enter the following details to the respective fields accordingly in the Connection Properties section:

Server: The host name or IP address of the server hosting the Databricks database.
User: The username used to authenticate with Databricks. When using personal access token, enter token in the User field.
Token: The token used to access the Databricks server. This token is generated by clicking the Create a personal access token link in the connection details of SQL Warehouse running on the Databricks instance.
HTTPPath: The HTTP Path of the SQL warehouse, available in the connection details (as shown above).
QueryPassThrough: This field must be set to True for using the BulkInsert option using a staging connection.
Auth Scheme: Syniti Replicate has been tested using the PersonalAccessToken authentication type, with a token generated on Databricks instance.

A screenshot of a computer

Note
Complete the rest of the fields in the Add Target Connection wizard. Refer to Add Target Connection Wizard for more information.

Connection Properties Dialog Box

The default insert mode for replicated data has been set as follows:

Refresh Replications: BulkInsert
Mirroring Replications: BulkInsert

To change these settings, and to complete the setup for Databricks, you need to edit the Connection Properties dialog box for the target connection created using the above Add Target Connection Wizard.

In Metadata Explorer, select and right-click the Databricks target connection you have created, and then click Connection Properties.
Set Staging Connection to Azure Data Lake Gen2, if using the BulkInsert mode for Refresh or Mirroring.
The Staging connection should have the following properties set with the respective values for BulkInsert accordingly:
- Add Transactional Info: None
- Serialization: Parquet
Access the Refresh Options section to change the Bulk Insert settings, as required.
Click OK to save your changes.

Enabling Unity Catalog

Important!
Unity Catalog must be enabled on the Databricks workspace for Syniti Replicate’s Databricks integration to function properly.

The user whose PersonalAccessToken is used to connect to the Databricks instance must be able to run SELECT statements on the CATALOGS, SCHEMATA, TABLES, COLUMNS, and KEY_COLUMN_USAGE tables in the INFORMATION_SCHEMA schema of the SYSTEM catalog. The user should be able to CREATE and DROP tables on the catalog and schema, where Syniti Replicate should move data and run SELECT, INSERT, UPDATE, DELETE, and TRUNCATE statements on the tables.

An ADLS Storage account (synitidatabricksunity in below screenshot) and an Access Connector for Azure Databricks should be created. The Access Connector for Azure Databricks should be assigned to the Storage Blob Data Contributor role in the storage account. Add an external location to the catalog using the storage account and connector created. A screenshot of a computer Description automatically generated

Note
Configure Unity Catalog with a managed identity.

Add an external location for Unity Catalog using a credential of the Managed Identity Credential Type.

The files uploaded by Syniti Replicate for bulk refresh and mirroring must live in the external location (which is the replicate container in this case). This location must be configured in Replicate through the staging connection.

An example of staging connection properties.

How to Configure a Simba Spark ODBC Provider

Download Simba Spark driver installer from here.
Install the Driver - Run .exe installer and follow the instructions displayed in the installer wizard.
Open ODBC Data Source Administrator and follow these steps:
- Press Win + R, type odbcad32.exe, and press Enter.
- Select the 32-bit or 64-bit ODBC Data Source Administrator based on your application's architecture.
Add a Data Source:
1. Access the System DSN or User DSN tab (System DSN is recommended for shared connections).
2. Click Add and select Simba Spark ODBC Driver from the list.
3. Click Configure and complete the setup using the following information:
- Data Source Name (DSN): Provide a name for your data source.
- Host: Enter the Spark server hostname.
- Port: Specify the port number (default is usually 443).
- Authentication Type: Select the appropriate method (e.g., None, Kerberos, LDAP).
  - For LDAP, provide a username and password.
- Transport Mode: Choose HTTP or Binary (HTTP is common for cloud-based Spark setups like Databricks).
- HTTP Path: Provide the HTTP path for your Spark server (e.g., /sql/protocolv1/o/<organizationID>/<clusterID> for Databricks).
- SSL Options: Enable SSL if required and provide the CA certificate if necessary.
Click Test to verify the configuration. Troubleshoot any errors based on the provided error message.