Apache Parquet

Prev Next

Overview

This topic describes how to set up your Syniti Replicate environment to replicate data to Apache Parquet files. When replicating from a relational database to a Parquet file, Syniti Replicate allows you to replicate data in two different replication modes:

  • Refresh: A one-time complete replication from source table to a Parquet file, according to replication settings and scripts. You can schedule the replication, identify the columns to be replicated, and add scripts to transform data during replication.

  • One-Way Mirroring: A continuous update of a replicated table to a Parquet file based on changes to the source database that have been recorded in the database server log. You can define the replication settings to check the transaction log on the source database at regular intervals. Any changes found in the log would be applied to the target file.

Prerequisites

For Parquet Serialization, complete the following steps:

  1. If you haven't installed NuGet, download the Windows x86 Commandline version from the Nuget site and move it to a required directory. For example, D:\Temp\nuget.exe.

  2. Create a local directory to download the required NuGet packages. For example, D:\Temp\Parquet.

  3. From the command prompt with admin permissions, execute the below command. Make sure to execute this command from the right directory.

    D:\Temp\nuget.exe install Parquet.Net -Version 4.10.1 -Framework net480 –OutputDirectory D:\Temp\Parquet

    This command will download the entire provider into the specified output directory.

  4. Copy the below files from the downloaded packages and paste it into Replicate’s Plugins\ParquetSerializer directory. For example, D:\Program Files\Syniti\Syniti Replicate\Plugins\ParquetSerializer.

    • D:\Temp\Parquet\Apache.Arrow.2.0.0\lib\netcoreapp2.1\Apache.Arrow.dll

    • D:\Temp\Parquet\IronCompress.1.3.0\lib\netstandard2.0\IronCompress.dll

    • D:\Temp\Parquet\Microsoft.Data.Analysis.0.20.1\lib\netstandard2.0\Microsoft.Data.Analysis.dll

    • D:\Temp\Parquet\Microsoft.IO.RecyclableMemoryStream.2.3.2\lib\netstandard2.0 \Microsoft.IO.RecyclableMemoryStream.dll

    • D:\Temp\Parquet\Microsoft.ML.DataView.2.0.1\lib\netstandard2.0\Microsoft.ML.DataView.dll

    • D:\Temp\Parquet\Parquet.Net.4.10.1\lib\netstandard2.0\Parquet.dll

    • D:\Temp\Parquet\Snappier.1.0.0\lib\netstandard2.0\Snappier.dll

    • D:\Temp\Parquet\System.Buffers.4.5.1\lib\netstandard2.0\System.Buffers.dll

    • D:\Temp\Parquet\System.Collections.Immutable.1.5.0\lib\netstandard2.0\System.Collections.Immutable.dll

    • D:\Temp\Parquet\System.Memory.4.5.5\lib\netstandard2.0\System.Memory.dll

    • D:\Temp\Parquet\System.Numerics.Vectors.4.5.0\lib\netstandard2.0\System.Numeric.Vectors.dll

    • D:\Temp\Parquet\System.Runtime.CompilerServices.Unsafe.6.0.0\lib\netstandard2.0\System.Runtime.CompilerServices.Unsafe.dll

    • D:\Temp\Parquet\System.Threading.Tasks.Extensions.4.5.4\lib\netstandard2.0\System.Threading.Tasks.Extensions.dll

      Note
      It is possible that the packages are a higher version than mentioned above. For example, instead of System.Threading.Tasks.Extensions.4.5.4, the package can be System.Threading.Tasks.Extensions.4.5.5. In this case, copy the version that has been downloaded.

Replicating Data from a Relational Database to Parquet

1. Set Up a Source Connection to a Relational Database

  1. Make sure you set up database connections via a .NET data provider to your source database. For each database you are planning to use in your replication project, complete the following steps:

    • Install and configure one of the supported data access products. See the list of supported .NET data providers in the knowledge base for a current list of supported providers.

    • From the data access product, test the connection to the database.

    • Create a connection string for the data access product/database you are using. Check the documentation for the data access product for information on how to do this.

    • Check that the user ID you are planning to use has sufficient permissions to complete all operations in Syniti Replicate. Contact the technical support team via the Help Center for specific requirements for your database.

  2. Open the Syniti Replicate: Management Center.
    Syniti Replicate provides a default database (Microsoft SQL Server CE) for your metadata, all the information that Syniti Replicate needs to store about your replication setup.

  3. In the Metadata Explorer, expand the metadata node to view the Sources and Targets nodes.

  4. Select the Sources node.

  5. Right-click and select Add New Connection.

  6. In the Add Source Connection Wizard, follow the initial steps to add a connection string and test the connection to the database.

  7. On the Set Connection String page, complete the Connection Properties field.

    Note

    • You must edit the Required connection properties by clicking in the property value field and typing a new value.

    • If you are using an Oracle 11 client, in the Data Source field, make sure to add the Oracle Service Name after the IP address.

  8. Select the tables that you want to replicate.

  9. On the Actions page, select the Launch the Enable Transactional Setup Wizard checkbox.

  10. Complete the wizard. Refer to Add Source Connection Wizard for more information.

Enable Transactional Replication

This section assumes you are planning to replicate data to Parquet using mirroring (rather than Refresh) from a relational database. For Refresh replications, you do not need to complete the Enable Transactional Replication wizard. If you have checked the Source Connection wizard option to launch the Enable Transactional Replication wizard, follow the steps below. To open the wizard from the Management Center, choose the connection in the Metadata Explorer, then right-click to choose Transactional Setup > Enable...

In the Enable Transactional Replication wizard, complete the following steps:

  1. Select the type of transactional replication to use. The options depend on the source database and can include log reader, log server agent, triggers, and plus log reader API (for IBM Db2 for i only). Refer to Choosing a Log Type for Transactional Replications for more information on the best option.

  2. Click Next to enter the log settings. The fields and appropriate values depend on the database and log type. You can obtain a setup topic’s link or guide for a specific relational database by making a request in the Help Center.

  3. Click Next to verify your settings against the source connection to the database. If any information is missing, you will not be able to proceed.

  4. On the Actions page, select the Add Target Connection wizard check box to launch the Add Target Connection Wizard.

  5. Click Next to review your changes.

  6. Click Finish to complete the wizard.

The source connection is now set up for transactional replications.

2. Set up a Target Connection for Parquet

  1. In the Metadata Explorer, select the Targets node.

  2. Right-click and select Add New Connection.

  3. In the Add Target Connection Wizard, select Files - Parquet from the Database list.

  4. Click Next.

  5. On the Set Connection String page, set a directory path for the output from replications between the source relational database and the Parquet file target.

  6. Complete the connection properties using the information in the following table:

    Property

    Description

    Output Folder

    The schema name and location to the Parquet files. Set an Output Folder available to the system where Syniti Replicate is running.

    Add transactional info

    Make sure that the Add Transactional Info field is set to Yes, at the beginning.

    Compression

    The compression method desired, Currently supported 0 - None, 1 – Gzip, 2 – Snappy

    • 0: None for no compression. This is the fastest way to write files, however they may end up slightly larger.

    • 1: Gzip, using gzip compression, is the slowest, but should produce the best results if maximum compression is your top priority.

    • 2: Snappy is the default level and is a perfect balance between compression and speed.

    Use Nullable Fields

    Set to 1 (True) to allow nullable fields in Parquet files.

    Use DateTime Fields

    Set to 1 (True) to allow datetime fields in Parquet files. If 0 (False) is selected, Strings are used instead of datetime values.

    Use One File Per Group

    Use just one file per each group.

  7. Click Next to display the Select tables page.
    At this point, there is no output structure available to display. You can add the information after completing the Target Connection wizard.

  8. Click Next to display the Summary page.

  9. Click Finish to complete the wizard.

3. Add Table Information to the Target Connection

The target connection is displayed in the Metadata Explorer, but you still need to add the Parquet file representation for source table data so that when you create replications below, you can specify a source table and target file.

  1. In Metadata Explorer, expand the source connection you created above.

  2. Select and drag a source table to the target file connection.

  3. The Create Target Table wizard is displayed. Although you are not actually creating tables, you can use this wizard to create a representation of the file data.

  4. On the Source Connection page, you should see the source table you selected above.

  5. Click Next to display the Target Connection page.

  6. Verify that the page displays the correct target connection name and table name.

  7. Click Next to review the table structure.
    At this point, you can modify data types, null values and so on, if you want to modify the data eventually sent to a file.

  8. Click Next to display the SQL Script page.
    The contents of this page are inactive because there is no editable SQL script to create a table. Instead, Syniti Replicate outputs the table information to a file.

  9. Click Next to display the Summary page.

  10. Click Finish to create the table representation in the Metadata Explorer.

  11. Click Yes to add the table name to the target connection entry in the Metadata Explorer.

  12. Repeat Steps 2 through 11 for each source table that you want to replicate to a file.

At the end of this process, you should have a list of table representations under the target connection in the Metadata Explorer.

4. Define Replications

  1. Expand the Metadata Explorer tree to display the table that contains the data you want to replicate.

  2. Select a table.

  3. Right-click the table and select Replication, and then select Create New Replication.

  4. On the Define Replication Type page, enter a name to identify the replication.

  5. Optionally provide a description of the replication.

  6. In the Replication Mode section, select Mirroring to capture changed data only.

  7. Click Next to move on to the Select Source Connection page.

  8. Select a source connection name from the list that includes all the source connections you have created in Syniti Replicate.

  9. Select the tables that you want to replicate from the list.

  10. If you want more information about the table before proceeding, click Open Table.

  11. Click Next to move on to the Source Log Info page.
    The first two fields on this page depend on the source database you are using. In this case, we assume you are using Microsoft SQL Server. Refer to Replication Wizard for more information on the other databases.

  12. In the Transaction ID field, click Read to open the Read Point dialog box.

    Select the current transaction or a transaction ID based on a specific time or date.

  13. Click OK to add the value to the Source Log Info page.

  14. Set the value of the Read Interval field to the frequency with which you want Syniti Replicate to check the transaction log for new events to mirror.

  15. Click Next to move on to the Select Target Connection screen.

  16. Select the target connection for text output from the list that includes all the target connections you have created in Syniti Replicate.

  17. Select the data set you want to replicate from the list.
    If the list is empty, exit the wizard and add or create a target data set.

  18. Click Next to move on to the Set Mapping Info page.
    Source columns and target data with the same name are automatically mapped.

  19. Click Next to move on to the Scheduling page.

  20. Make sure the Enable Replication checkbox is selected. This checkbox must be selected for the replication to run.

  21. Set a start time for the replication. The Start Time field indicates the time at which the Replication Agent will begin considering the replication for execution.

  22. Select Execute Initial Refresh.
    A full replication will be performed from the source table to the data file, prior to starting the mirroring process where only incremental changes will be replicated.

  23. Access the Mirroring Schedule tab and select how you want to run the replication:

    • Run Continuously: The transaction log will be verified for changes to the table using the frequency that you specified on the Set Log Info tab. Any changes will be replicated to the target table.

    • Schedule Interruptions: The replication process will run as above, except for interruptions specified when you click the Schedule button in the Scheduler dialog box.

  24. Click Next to move on to the Summary page.

  25. Click Finish to complete the wizard.

5. Run the Replication

If you have installed The Replication Agent as a service during setup, you need to start the service using the Service Monitor program DBMotoServiceMonitorReplicatorOff.png in the Windows Notification Area. The replication that you have scheduled should start at the specified time.

If you would like to install the Replication Agent as a service:

  1. From the Windows Desktop Start menu, select Syniti Replicate, and then select Service Installer.

  2. Manage the service from Service Monitor program located in the Syniti Replicate install folder or on the Windows Start > Programs > Startup menu.

  3. In the Management Center, access the Replication Monitor tab to track the progress of the replication.

To run the Replication Agent interactively:

  1. In Windows Task bar > Notification section, right-click the Service Monitor DBMotoServiceMonitorReplicatorOff.png icon.

  2. Select Replication Agent > Start > Application. The replication that you have scheduled should start at the specified time.

  3. In the Management Center, access the Replication Monitor tab to track the progress of the replication.

Parquet File Structure

The file format generated by the process will have a .parquet extension. In addition, files used as a replication target have the following structure. The file has a .mir.parquet extension when used for mirroring, and a .ref.parquet extension when used for snapshot or refresh replications.

Column

Value

1

Sequence number

2

Timestamp of transaction

3

User, if provided by the database

4

Table name

5

Operation code:

  • I: Insert

  • UB: Update Before Image

  • UA: Update After Image

  • D: Delete

6

Record image