TL;DL – .NET Rocks! 1370 (Data Lakes with Michael Rys)

Show link: .NET Rocks! 1370: Data Lakes with Michael Rys

Guest: Michael Rys (pronounced “riss”)

Show Notes

  • What is a data lake?
    • It’s a new-ish approach on how to do analytics. A traditional data warehouse involves a schema, then you have to process your data to put into that schema; however, you typically later on find that your schema is insufficient for future needs.
    • The philosophy involves a highly scalable store where you keep the data (e.g., log files, images, CSV files, a database itself) in its original format. Data scientists can then explore the original data. (You can still schematize if needed.)
    • See: Azure Data Lake
  • Do you have do define what fields mean?
    • It depends on original data; the more information you can provide, the easier it is to schematize and post queries over it.
    • Example: image files. Your first query may involve finding colors (e.g., which image is blue), but future queries may involve semantics (e.g., which image contains a car).
    • Doing the feature extraction a priori means you lose data.
  • The data in lakes is pre-ETL; the data prep process inside the analytics component is doing the case-by-case data transformation normally done by the ETL process.
  • A document database has at least has a group of similar docs; a data lake could have all sorts of things/files.
    • Yes. Most of the time you have an intrinsic structure that you can design a query around (e.g., XQuery); in U-SQL parlance you build extractors.
  • Is there a typical way you organize the data lake itself?
    • Most likely, yes. For example, log files can be ordered by folder structure (e.g., time, cluster name, market, language) that you can exploit using queries with a file set.
  • Is there such thing as an index in a data lake?
    • Yes, depending on what your query language supports. You get standard cluster indices after you “cook the data” (i.e., take your original data in the lake and start schematizing it, cleaning it up, and at the end you feel that the data is in a format you want to persist as a copy). You can virtualize the cooked data so that your scale-out processing gets additional benefits — partition elimination or filter push-downs.
  • Do we have a recipe for the cooking?
    • The language and tooling Azure Data Lakes provide as part of the analytics framework (U-SQL). Microsoft also has best practices around this (e.g., if you have point location queries then you might want to have a hash partition, if you have range queries you use range partitioning).
  • After the cooking, you’re assuming it will be optimized.
    • It depends on if you have few documents or several, as these will be partitioned to scale out the processing.
  • Do I run more Azure instances for a shorter period, or fewer for longer?
    • The Azure Data Lakes Analytics Unit is one degree of parallelism times the hourly rate.
    • There are tools to show how the job would execute as you adjust the allocated resources.
    • Mapping larger usually isn’t a big deal, but reducing gets to be high overhead.
  • Are there case studies in terms of how much data was stored and how long it took to do a query?
    • As a demo of Azure telemetry data… 60,000 vertices (units of work), reading/writing ~55 TB took 4.5 hours with 200 nodes; the equivalent CPU time was 770 hours.
  • U-SQL is like SQL?
    • The “U” comes from combining SQL with an internal big data project “Cosmos”; the goal was to make it easy for people to use a declarative language to make it scalable and optimizable.
    • Apache Hive and Spark have something similar.
    • SQL is a good language for expressing sets.
    • Extensions make for easier extensibility. Their type system is C#, and they use Roslyn under the hood.
  • Are data lakes a good fit for Internet of Things (IoT) projects?
    • You can use a lambda architecture with hot path (i.e., what’s happening real time) and cold path (i.e., events stored in the data lake).
    • The stream analytics are done on a longer time scale (e.g., hourly, weekly).
    • You can set up the pipeline so you only put data into the lake if you have a change from the last entry (i.e., delta). Otherwise, you can end up paying for unnecessary storage.
  • Does data just stay in the lake for a certain period of time?
    • Yes. Azure Data Lake has only batch mode, but other data lakes (e.g., HDInsight), you may just keep the data in the lake itself.
    • In other cases you may want to move the data into a SQL database or data warehouse for doing final reporting.
  • Why would you pick Hadoop to extract the data?
    • It depends on the expertise of your people (e.g., Hadoop is Java-based)? Use U-SQL if you have good SQL and .NET knowledge. Either way, you’re operating on the same lake — use the tooling you’re most familiar with.
    • HDInsight is a cluster service, which you pay for even if you’re not doing any processing on it; also, it can’t scale on-demand as much. Azure Data Lake is job-based, pay as you go, and scale as you go.
  • Is Azure Data Lake storage more economical than storage in a SQL Azure database?
    • This is comparing apples to oranges, as you’re paying for processing that comes with the database.
    • You need to look at the data storage and the compute costs over time.
    • A document database (or Azure Blob storage) is sort of in between. It has good scale-out, but a limited query storage model.
    • Azure Data Lakes can access data in Blob storage and SQL Azure databases; Blob storage does not have a file API and its security model is not integrated with Azure Active Directory.
  • What’s the relationship between the analytics part and Azure Machine Learning and predictive analytics?
    • Microsoft is currently working on this. Azure Machine Learning is its own service, so you can have data in the lake that would feed into the model generation.
    • You can use the Azure Data Factory to do data movement.
    • Eventually this functionality could be accessible through the U-SQL script.
  • Microsoft Research project Dryad didn’t gain a lot of traction because MapReduce was so popular in the industry (and Dryad did much more than this). Apache Spark is inspired by Dryad. Azure Data Lake Analytics is using YARN as the resource manager, and the execution is Dryad-based.
  • “Why can’t I connect to external web sources from my C# code?” Just imagine that you have several million rows and you do an IP lookup via a web service, and you do that scaled out on 1000 nodes within Azure Data Lake. That lookup resource would probably end up blocking the Azure IP range, as it would look like a DDoS attack.
  • What are the data loading mechanisms available to fill your data lake?
  • When data sets get really big, are there physical transport methods?
    • Ship physical disks.
    • Obtain a fast direct connection (i.e., a fiber connection into the data center directly).
    • Use a PowerShell upload script which parallelizes the operation by separating file extents.
    • Pay for orchestration; use Azure Data Factory for dealing with on- and off-premises instances and other clouds.
  • What’s next for Azure Data Lakes?
    • It’s currently in public preview with all components except for HDInsight (which is already GA); there will be GA inside US data centers before end of 2016.
    • They will be available in non-US data centers by early 2017. (The Europeans typically don’t want their data on US servers.)
    • The team is adding new functionality based on user feedback (e.g., better streaming interaction).
    • There is interest for making some U-SQL clauses optional.
    • There is a goal to make it easier for people to write assemblies in other languages (e.g., JVM, Python, JavaScript).

Better Know a Framework

Listener E-mail

From show #1327 (R for the .NET Developer with Jamie Dixon and Evelina Gabasova); a massive SQL procedure written in object-oriented fashion took 16 hours to run, but when optimized to be set-based it took 2 hours.

Technology Giveaway Ideas