Microsoft Azure Data Lake

Prev Next

Note

In response to Microsoft’s retirement of Azure Data Lake Storage (ADLS) Gen 1 on February 29, 2024, Syniti Replicate discontinues support for using ADLS Gen 1 for replication services. Refer to the retirement announcement here.

If you are using ADLS Gen 1, Syniti recommends to use ADLS Gen 2.

Syniti Replicate supports the use of Microsoft ADLS Gen 2 for:

Microsoft ADLS Gen 2

This topic describes how to set up your Syniti Replicate environment to replicate data to a Microsoft Azure Data Lake Storage (ADLS) Gen2 as target connection. The replicated data is stored in flat file format in ADLS Gen2. When replicating from a relational database to ADLS Gen2, Syniti Replicate allows you to replicate data in two different modes:

  • Refresh: A one-time complete replication from source table to ADLS Gen2, according to replication settings and scripts. You can schedule the replication, identify the columns to be replicated, and add scripts to transform data during replication.

  • One-Way Mirroring: A continuous update of a replicated table to ADLS Gen2 based on changes to the source database that have been recorded in the database server log. You can define the replication settings to check the transaction log on the source database at regular intervals. Any changes found in the log would be applied to the target.

Prerequisites

  • For ADLS Gen2 targets, the .NET provider and plugin are included with the Syniti Replicate installation package.

  • Ensure that you create an application ID in Azure Active Directory, this application ID must have write access to the specified ADLS Gen2 target container.

  • In the Access Control (IAM) settings for the ADLS Gen2 file system, assign the Storage Blob Data Contributor permissions to the role used in Syniti Replicate.

  • Make sure you have database connections via a .NET data provider to the supported databases for source database tables.

  • For Parquet Serialization, complete the following steps:

    1. Download the Parquet .Net v4.10.1 provider from the Nuget site.

    2. Create a local directory to deploy the required files (for example, C:\tmp\adls) and save those files in it.

    3. From the command prompt with admin permissions, execute the below command. Make sure to execute this command from the right directory.

      nuget install Parquet.Net -Version 4.10.1 -Framework net480 -OutputDirectory C:\tmp\adls

      This command will download the entire provider into the specified output directory.

    4. Copy the below files from the downloaded packages and paste it into Replicate’s Plugins\ParquetSerializer directory. For example, C:\Program Files\Syniti\Syniti Replicate\Plugins\ParquetSerializer.

      • D:\Temp\Parquet\Apache.Arrow.2.0.0\lib\netcoreapp1.3\Apache.Arrow.dll

      • D:\Temp\Parquet\IronCompress.1.3.0\lib\netstandard2.0\IronCompress.dll

      • D:\Temp\Parquet\Microsoft.Data.Analysis.0.20.1\lib\netstandard2.0\Microsoft.Data.Analysis.dll

      • D:\Temp\Parquet\Microsoft.IO.RecyclableMemoryStream.2.3.2\lib\netstandard2.0\Microsoft.IO.RecyclableMemoryStream.dll

      • D:\Temp\Parquet\Microsoft.ML.DataView.2.0.1\lib\netstandard2.0\Microsoft.ML.DataView.dll

      • D:\Temp\Parquet\Parquet.Net.4.10.1\lib\netstandard2.0\Parquet.dll

      • D:\Temp\Parquet\Snappier.1.0.0\lib\netstandard2.0\Snappier.dll

      • D:\Temp\Parquet\System.Buffers.4.5.1\lib\netstandard2.0\System.Buffers.dll

      • D:\Temp\Parquet\System.Collections.Immutable.1.5.0\lib\netstandard2.0\System.Collections.Immutable.dll

      • D:\Temp\Parquet\System.Memory.4.5.5\lib\netstandard2.0\System.Memory.dll

      • D:\Temp\Parquet\System.Numerics.Vectors.4.5.0\lib\netstandard2.0\System.Numeric.Vectors.dll

      • D:\Temp\Parquet\System.Runtime.CompilerServices.Unsafe.6.0.0\lib\netstandard2.0\System.Runtime.CompilerServices.Unsafe.dll

      • D:\Temp\Parquet\System.Threading.Tasks.Extensions.4.5.4\lib\netstandard2.0\System.Threading.Tasks.Extensions.dll

        Note

        It is possible that the packages are a higher version than mentioned above. For example, instead of System.Threading.Tasks.Extensions.4.5.4, the package can be System.Threading.Tasks.Extensions.4.5.5. In this case, copy the version that has been downloaded.

Limitations

  • Azure Data Lake is supported only as target system in Refresh and Mirroring mode.

  • Access via Proxy is not supported. Syniti Replicate needs to connect directly with Azure Data Lake.

  • Data Encryption is not supported for Azure Data Lake.

  • Data Masking is not supported in Syniti Replicate.

  • Key Vault is not supported in Syniti Replicate.

  • Database and schema names containing special characters like slash or backslash are not supported.

Configure ADLS Gen2 in Syniti Replicate

Using ADLS Gen2, you can:

Set Up a Target Connection to ADLS Gen2

This section assumes that you have selected the Enable Transactional Replication wizard option to launch the Add Target Connection wizard. If not, to open the wizard:

  1. In the Management Center, select Targets in the Metadata Explorer.

  2. Right-click and select Add New Connection. The Add Target Connection Wizard is displayed.

To set up a Target Connection to ADLS Gen2:

  1. In the Add Target Connection Wizard, select the Microsoft Azure Data Lake Gen2 option from the Provider field.

  2. Click Next to move to the Set connection string page.

  3. Set the following values to the respective fields accordingly:

Field Name

Description

Output Folder

The location of the schema for config files for the ADLS Gen2 objects

Client Id

Client ID in GUID format that identifies the client application. (Application added in App registrations in the Active Directory)

Tenant Id

TenantID (Directory) in grid format representing the Active Directory Tenant.

Client Secret

Client secret token used from the client application. It can be generated by clicking Certificates & Secrets for the App added in App registrations.

Storage Account Name

The name of the ADLS Gen2 account used to load data.

Target Filesystem Name

Target FileSystem or Container name to use in loading data in ADLS Gen2.

Optional Properties

Set additional properties for the following fields accordingly:

Field Name

Description

Use Column Names

Determinates if the output CSV files generated during replication will contain the list of column names in the first row.

Add Transactional Info

Determinates if the output CSV files generated during replication will contain transactional info columns (operation type, transaction id, transaction timestamp, transaction user, and so on) and if these information will appear at the beginning of the record.

Transactional Column Prefix

Prefix to add to transactional fields (if selected). This option will allow using a custom format for the transactional fields to distinguish them from the data fields

Serialization

Define the type of serialization required for the Azure Data Lake:

  • 0: CSV

  • 1: Parquet

Column Separator

[CSV Only] Indicates the character used as column separator in the CSV file.

  • \t: Tab Separator

  • ,: Comma

  • |: Pipe

Column Escape

[CSV Only] Indicates the escape character used for the column separator. Default backslash (\). Editable value.

String Separator

[CSV Only] Indicates the character used as string separator in the CSV file. Default quotation marks ("). Editable value.

String Escape

[CSV Only] Indicates the escape character used for the string separator. Default backslash (\). Editable value.

Compression

[Parquet Only] The Parquet compression method used.

  • 0-None: No Compression.

  • 1-Gzip: Offers the highest compression ratio, it is not splitable and takes sometimes to encode and decode.

  • 2-Snappy: Offers less compression but is splitable and quick to encode and decode.

Use Nullable Fields

[Parquet Only] Allow Nullable fields.

Use DateTime Fields

[Parquet Only] Allow DataTime fields.

Use One File Per Group

Indicates if each replication group generates a common CSV file or one file for each replication in the group.

Extended Properties

Additional properties not listed in the grid. Enter the properties as a list of "property=value" string separated by (;).

File Structure

Data files replicated to ADLS Gen2 using mirroring have the following structure. Files have a .mir extension when used for mirroring and a .ref extension when used for snapshot or refresh replications.

Column

Value

1

Sequence number

2

Timestamp of transaction

3

User, if provided by the database

4

Table name

5

Operation code – one of I (Insert), U (Update), or D (Delete)

6

Previous primary key value for the database column. This is used for update operations to identify the correct record

7

Next primary key value. This is used for update operations to identify the correct record in cases where the primary key is changed during the update operation

8

Database column value

Example File

Example filename: corp_20150115T161123155064_1.mir

10,2015-01-22 15:32:39.000000,,"glb","I",,2,"test insert"
11,2015-01-22 15:32:40.000000,," glb ","D",10,,
12,2015-01-22 15:32:42.000000,," glb ","U",104,104,"my updated"
13,2015-01-22 15:32:44.000000,," glb ","U",1,12,"sd"

Partitions for Refresh Replications

Use this dialog box to set up partitions for refresh replications to improve replication efficiency.

Refresh Partitions allow you to define multiple partitions or subsets within a replication during the refresh operation, so that multiple parallel threads move a specific subset of the data.

Syniti Replicate will create same number of files according the number of partitions defined in the source system, for example, if the user creates 10 partitions per table, Syniti Replicate will create 10 CSV or Parquet files into the directory folder in ADLS Gen2.

Data Type Mapping

When ADLS Gen2 is used as a target system, the mapping of Syniti Replicate data types to the corresponding data type in ADLS Gen2 is listed in the following table:

Syniti Replicate Mapping

Azure Data Lake

AutoDecimal

decimal

AutoInteger

integer

AutoNumeric

decimal

AutoSmallint

smallint

AutoTinyint

tinyint

Bigint

bigint

Binary

binary

Bit

bit

Blob

binary

Char

string

Clob

string

Date

date

Decimal

decimal

Double

double

Float

float

GUID

string

Integer

integer

LongVarbinary

binary

LongVarchar

string

Money

decimal

NChar

string

NText

string

Number

decimal

Numeric

decimal

NVarchar

string

Real

real

Smalldatetime

datetime

Smallint

smallint

Smallmoney

decimal

Sysname

string

Time

time

Timestamp

timestamp

TimestampZ

timestamptz

Tinyint

tinyint

Varbinary

binary

Varchar

string

Xml

string