Azure Data Migration Strategy

The options listed below may be considered based on specific data processing and analytics needs:

Azure HDInsight is the perfect choice for those enterprises, who wish to manage both Hadoop, Spark and enjoy the ease of manageability across Big Data workloads.

Advantages

  • Highly scalable and available
  • Great backup facility and disaster recovery
  • Simplified cluster creation and deletion
  • Easy data management and available for retrieval at any time
  • Cheaper as compared to on-premise Hadoop and is cost-effective to collect, and store structured or unstructured data
  • Almost 99% SLA at large scale with excellent support from Microsoft
  • More than 30 popular applications to choose and deployed to the cluster within minutes

Limitations

  • Overall cost is relatively high even after it being cost effective as compared to on-premise Hadoop
  • Lack of integration with other Azure platforms
  • Spark version is outdated
  • Need Azure expertise to handle errors and adapt the application
  • Performance issues are noticed by certain customers while dealing with large volume data

Azure Synapse offers data integration, big data analytics, and enterprise data warehousing services through its unified limitless analytics platform.

Advantages

  • Can be easily provisioned with existing Azure subscription and provides pay-as-you-go pricing
  • Integration with Azure Active Directory and Azure Purview can provide an easy way to manage user roles and insights into data
  • Transferable knowledge from on-premise Microsoft SQL Server background

Limitations

  • Difficulty managing high volumes of concurrent queries due to tuning and cost of higher service tiers
  • Requires complex database administration tasks, including performance tuning, which other cloud data solutions have made more turnkey
  • Serverless capabilities are limited to newer Azure services, and lacks the on-demand, frictionless sizing of compute within Azure DW (Data Warehouse)

Azure Machine Learning is designed to help data scientists and developers quickly build, deploy, and manage models via machine learning operations (MLOps), open-source interoperability, & integrated tools

Advantages

  • No data limit for pulling data from Azure storages and HDFS system
  • Easy to use set of tools and less restrictive on the quality of the training data
  • Easy to import training data, and then tune the results
  • Cost of maintenance is less compared to on-premise analytics solutions
  • Built-in R module, support for python¬†& options for custom R code for extensibility
  • Security for Azure ML Service relies on Azure security measures

Limitations

  • Fewer algorithms and other transformations available and users may need to implement custom solutions for specific algorithms.
  • Expensive for large scale ML projects
  • Complex environment for beginners and takes time to understand the UI

Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation

Advantages

  • Serverless solution reduces meticulous tasks and maintenance
  • Seamless integration with 3rd party connectors
  • Easy to create pipelines schedules and execute SSIS/SSMS packages
  • Copies data from sources such as JSON, Azure database, API, Azure Synapse
  • Linked service in many pipelines with fast & convenient multiple connectors
  • Access to multiple users from remote locations

Limitations

  • Building custom activities and integrations is not included within the service so users rely on Azure itself for new features
  • Integrations with non-Azure services is limited and although integrations are provided for other CSPs such as Amazon S3, Google BigQuery however standalone ADF would not stand out as our tool of choice in a multi-cloud strategy
  • ADF interface contains a tab for monitoring pipelines and activities which is table-based and does not offer as great an overview as Airflow

Azure Stream Analytics is a fully managed, serverless engine by Microsoft for real-time analytics supporting real-time analytics on multiple data streams from sources such as sensors, web data sources, social media, and other applications.

Advantages

  • Better suited for simpler data processing scenarios that require real-time analytics and insights
  • Optimized for real-time streaming analytics and can process large volumes of data in real-time
  • Can integrate with other Azure services such as Azure Cognitive Services, Azure Storage, or Azure Functions

Limitations

  • Only supports SQL
  • Input data needs to be AVRO, JSON, OR CSV formats
  • Can only use blob storage to add static data
  • Can only integrate with Azure services
Case Studies