Microsoft Azure Data Lake

Note
In response to Microsoft’s retirement of Azure Data Lake Storage (ADLS) Gen 1 on February 29, 2024, Syniti Replicate discontinues support for using ADLS Gen 1 for replication services. Refer to the retirement announcement here.

If you are using ADLS Gen 1, Syniti recommends to use ADLS Gen 2.

Syniti Replicate supports the use of Microsoft ADLS Gen 2 for:

Refresh replications, target only.
Mirroring replications, target only.

Microsoft ADLS Gen 2

This topic describes how to set up your Syniti Replicate environment to replicate data to a Microsoft Azure Data Lake Storage (ADLS) Gen2 as target connection. The replicated data is stored in flat file format in ADLS Gen2. When replicating from a relational database to ADLS Gen2, Syniti Replicate allows you to replicate data in two different modes:

Refresh: A one-time complete replication from source table to ADLS Gen2, according to replication settings and scripts. You can schedule the replication, identify the columns to be replicated, and add scripts to transform data during replication.
One-Way Mirroring: A continuous update of a replicated table to ADLS Gen2 based on changes to the source database that have been recorded in the database server log. You can define the replication settings to check the transaction log on the source database at regular intervals. Any changes found in the log would be applied to the target.

Prerequisites

For ADLS Gen2 targets, the .NET provider and plugin are included with the Syniti Replicate installation package.
Ensure that you create an application ID in Azure Active Directory, this application ID must have write access to the specified ADLS Gen2 target container.
In the Access Control (IAM) settings for the ADLS Gen2 file system, assign the Storage Blob Data Contributor permissions to the role used in Syniti Replicate.
Make sure you have database connections via a .NET data provider to the supported databases for source database tables.
For Parquet Serialization, complete the following steps:
1. Download the Parquet .Net v4.10.1 provider from the Nuget site.
2. Create a local directory to deploy the required files (for example, C:\tmp\adls) and save those files in it.
3. From the command prompt with admin permissions, execute the below command. Make sure to execute this command from the right directory.
```
nuget install Parquet.Net -Version 4.10.1 -Framework net480 -OutputDirectory C:\tmp\adls
```
  This command will download the entire provider into the specified output directory.
4. Copy the below files from the downloaded packages and paste it into Replicate’s Plugins\ParquetSerializer directory. For example, C:\Program Files\Syniti\Syniti Replicate\Plugins\ParquetSerializer.
  - D:\Temp\Parquet\Apache.Arrow.2.0.0\lib\netcoreapp1.3\Apache.Arrow.dll
  - D:\Temp\Parquet\IronCompress.1.3.0\lib\netstandard2.0\IronCompress.dll
  - D:\Temp\Parquet\Microsoft.Data.Analysis.0.20.1\lib\netstandard2.0\Microsoft.Data.Analysis.dll
  - D:\Temp\Parquet\Microsoft.IO.RecyclableMemoryStream.2.3.2\lib\netstandard2.0\Microsoft.IO.RecyclableMemoryStream.dll
  - D:\Temp\Parquet\Microsoft.ML.DataView.2.0.1\lib\netstandard2.0\Microsoft.ML.DataView.dll
  - D:\Temp\Parquet\Parquet.Net.4.10.1\lib\netstandard2.0\Parquet.dll
  - D:\Temp\Parquet\Snappier.1.0.0\lib\netstandard2.0\Snappier.dll
  - D:\Temp\Parquet\System.Buffers.4.5.1\lib\netstandard2.0\System.Buffers.dll
  - D:\Temp\Parquet\System.Collections.Immutable.1.5.0\lib\netstandard2.0\System.Collections.Immutable.dll
  - D:\Temp\Parquet\System.Memory.4.5.5\lib\netstandard2.0\System.Memory.dll
  - D:\Temp\Parquet\System.Numerics.Vectors.4.5.0\lib\netstandard2.0\System.Numeric.Vectors.dll
  - D:\Temp\Parquet\System.Runtime.CompilerServices.Unsafe.6.0.0\lib\netstandard2.0\System.Runtime.CompilerServices.Unsafe.dll
  - D:\Temp\Parquet\System.Threading.Tasks.Extensions.4.5.4\lib\netstandard2.0\System.Threading.Tasks.Extensions.dll
    Note
    It is possible that the packages are a higher version than mentioned above. For example, instead of System.Threading.Tasks.Extensions.4.5.4, the package can be System.Threading.Tasks.Extensions.4.5.5. In this case, copy the version that has been downloaded.

Limitations

Azure Data Lake is supported only as target system in Refresh and Mirroring mode.
Access via Proxy is not supported. Syniti Replicate needs to connect directly with Azure Data Lake.
Data Encryption is not supported for Azure Data Lake.
Data Masking is not supported in Syniti Replicate.
Key Vault is not supported in Syniti Replicate.
Database and schema names containing special characters like slash or backslash are not supported.

Configure ADLS Gen2 in Syniti Replicate

Using ADLS Gen2, you can:

Set a replication using Refresh Mode, or
Set a replication using One-way Mirroring Mode
If your replication is Mirroring mode (Change Data Capture), you can define your Transactional Strategy.

Set Up a Target Connection to ADLS Gen2

This section assumes that you have selected the Enable Transactional Replication wizard option to launch the Add Target Connection wizard. If not, to open the wizard:

In the Management Center, select Targets in the Metadata Explorer.
Right-click and select Add New Connection. The Add Target Connection Wizard is displayed.

To set up a Target Connection to ADLS Gen2:

In the Add Target Connection Wizard, select the Microsoft Azure Data Lake Gen2 option from the Provider field.
Click Next to move to the Set connection string page.
Set the following values to the respective fields accordingly:

Field Name	Description
Output Folder	The location of the schema for config files for the ADLS Gen2 objects
Client Id	Client ID in GUID format that identifies the client application. (Application added in App registrations in the Active Directory)
Tenant Id	TenantID (Directory) in grid format representing the Active Directory Tenant.
Client Secret	Client secret token used from the client application. It can be generated by clicking Certificates & Secrets for the App added in App registrations.
Storage Account Name	The name of the ADLS Gen2 account used to load data.
Target Filesystem Name	Target FileSystem or Container name to use in loading data in ADLS Gen2.

Optional Properties

Set additional properties for the following fields accordingly:

Field Name	Description
Use Column Names	Determinates if the output CSV files generated during replication will contain the list of column names in the first row.
Add Transactional Info	Determinates if the output CSV files generated during replication will contain transactional info columns (operation type, transaction id, transaction timestamp, transaction user, and so on) and if these information will appear at the beginning of the record.
Transactional Column Prefix	Prefix to add to transactional fields (if selected). This option will allow using a custom format for the transactional fields to distinguish them from the data fields
Serialization	Define the type of serialization required for the Azure Data Lake: 0: CSV 1: Parquet
Column Separator	[CSV Only] Indicates the character used as column separator in the CSV file. \t: Tab Separator ,: Comma \|: Pipe
Column Escape	[CSV Only] Indicates the escape character used for the column separator. Default backslash (\). Editable value.
String Separator	[CSV Only] Indicates the character used as string separator in the CSV file. Default quotation marks ("). Editable value.
String Escape	[CSV Only] Indicates the escape character used for the string separator. Default backslash (\). Editable value.
Compression	[Parquet Only] The Parquet compression method used. 0-None: No Compression. 1-Gzip: Offers the highest compression ratio, it is not splitable and takes sometimes to encode and decode. 2-Snappy: Offers less compression but is splitable and quick to encode and decode.
Use Nullable Fields	[Parquet Only] Allow Nullable fields.
Use DateTime Fields	[Parquet Only] Allow DataTime fields.
Use One File Per Group	Indicates if each replication group generates a common CSV file or one file for each replication in the group.
Extended Properties	Additional properties not listed in the grid. Enter the properties as a list of "property=value" string separated by (;).

File Structure

Data files replicated to ADLS Gen2 using mirroring have the following structure. Files have a .mir extension when used for mirroring and a .ref extension when used for snapshot or refresh replications.

Column	Value
1	Sequence number
2	Timestamp of transaction
3	User, if provided by the database
4	Table name
5	Operation code – one of I (Insert), U (Update), or D (Delete)
6	Previous primary key value for the database column. This is used for update operations to identify the correct record
7	Next primary key value. This is used for update operations to identify the correct record in cases where the primary key is changed during the update operation
8	Database column value

Example File

Example filename: corp_20150115T161123155064_1.mir

10,2015-01-22 15:32:39.000000,,"glb","I",,2,"test insert"
11,2015-01-22 15:32:40.000000,," glb ","D",10,,
12,2015-01-22 15:32:42.000000,," glb ","U",104,104,"my updated"
13,2015-01-22 15:32:44.000000,," glb ","U",1,12,"sd"

Partitions for Refresh Replications

Use this dialog box to set up partitions for refresh replications to improve replication efficiency.

Refresh Partitions allow you to define multiple partitions or subsets within a replication during the refresh operation, so that multiple parallel threads move a specific subset of the data.

Syniti Replicate will create same number of files according the number of partitions defined in the source system, for example, if the user creates 10 partitions per table, Syniti Replicate will create 10 CSV or Parquet files into the directory folder in ADLS Gen2.

Data Type Mapping

When ADLS Gen2 is used as a target system, the mapping of Syniti Replicate data types to the corresponding data type in ADLS Gen2 is listed in the following table:

Syniti Replicate Mapping	Azure Data Lake
AutoDecimal	decimal
AutoInteger	integer
AutoNumeric	decimal
AutoSmallint	smallint
AutoTinyint	tinyint
Bigint	bigint
Binary	binary
Bit	bit
Blob	binary
Char	string
Clob	string
Date	date
Decimal	decimal
Double	double
Float	float
GUID	string
Integer	integer
LongVarbinary	binary
LongVarchar	string
Money	decimal
NChar	string
NText	string
Number	decimal
Numeric	decimal
NVarchar	string
Real	real
Smalldatetime	datetime
Smallint	smallint
Smallmoney	decimal
Sysname	string
Time	time
Timestamp	timestamp
TimestampZ	timestamptz
Tinyint	tinyint
Varbinary	binary
Varchar	string
Xml	string