-
What size of data are you storing, how often will the data be arriving and is your data coming from within Azure, for example, Azure storage, Event Hubs?
Understanding the data volumes, speed of arrival and rate of change is important in deciding how to handle and store the data for processing. Understanding if this is streaming data or batch ingest will determine of it has already been processed or might require transformation - which could also be achieved at runtime.
-
How will the data be formatted when it leaves the source system and how do you plan on consuming that data once it is in Azure?
Understanding how the data will be used and what format it will arrive in is critical to defining what processing needs to take place, or what schema on read resources might need to be used to facilitate data access.
-
How will the data be consumed ? Is the expectation that there will be some type of interactive query capability or will this be batch processed ?
Understanding how the data will be accessed is critical in the choice of BigData technology. Interactive query leads to technologies such as HDInsight LLAP (Interactive Hive), Azure SQL Data Warehouse or even Spark/DataBricks. Those can also work for batch but there are also Hive, MapReduce and similar technologies that can help with batch scenarios.
-
How sensitive is the data being stored, what retention policies are you using, what access restrictions are needed with the data?
Understanding the security and compliance is critical when storing potentially sensitive customer data. Does the company have permission to hold the data, is the data sensitive and therefore required audit or restricted access ? Will the resulting output from the data process be accessible to everyone or not ?