Every machine learning project requires a deep understanding of the data, to be able to understand whether the data is representative of the problem to be solved, to determine the approaches to be undertaken and indeed for the project to be successful.
Understanding of the data typically takes place during the Exploratory Data Analysis (EDA) phase. It is complex part of the project where data is attempted to be cleansed, outliers identified and the suitability of the data is assessed to inform hypothesis and experiments.
The following image illustrates the various phases, their respective complexity and roles during a typical machine learning project:
The Data Discovery Playbook aims to quickly provide structured views on your text, images and videos, all at scale using Synapse and unsupervised ML techniques that exploit state of the art deep learning models.
The goal is to present this data to and facilitate discussion with a business user/data owner very quickly via PowerBI visualisation, so that the customer and team can decide the next best action with the data, identify outliers or generate a training data set for a supervised model.
Another goal is to help simplify and accelerate the complex Exploratory Data Analysis phase of the project by democratising common data science functions and to accelerate your project so that can focus more on the business problem you are trying to solve.
Keep all code assets standalone and as simple as possible for quick usage or adaptation for production usage.
The aim of this Playbook is to illustrate the usage of the tools, alongside guidance, examples and documentation to get rapid insights of your unstructed data, all of which have been applied in real customer solutions.
The intended audience of this Playbook includes:
This Playbook provides code to quickly discover data as part of the Exploratory Data Analysis phase of the project. The overall approach is to take a large unstructured dataset that has no labels available, and to iterate over the data using a variety of techniques to aggregate, cluster and ultimately label the data in a cost effective and timely manner.
This is achieved by using unsupervised ML clustering algorithms, heuristic approaches and by direct input and validation by a domain expert. Asking questions of the data in natural language is also possible, if text based, using the semantic search feature of Azure Cognitive Search.
By combining these approaches, structure and labels can be applied to large datasets so that the data may either be indexed for discovery via a search solution such as Azure Cognitive Search or for a supervised ML model to be trained so that future unseen data can be classified accordingly.
The following illustrates this approach at a high level for a text based problem where large amounts of unstructured data exists:
PixPlotML is an interactive and zoomable visualization of your whole dataset. This web-based tool, a modified version of the original Pixplot, is valuable for object detection and classification projects to perform these tasks:
Images that look similar are located next to or near each other, making it easy to see where errors occur (in the UMap visualization).
Hypothesis driven development and experiment tracking π§ͺCode written during EDA may not make it to production, but treating it as production code is vital as it provides an audit and represents the investment made to determine the correct ML solution as part of a hypothesis driven development approach.
This allows teams to not only reproduce the experiments but also be able to learn from past lessons learnt, saving time and associated development costs.
All Synapse notebooks contain full AML and MLFlow experiment tracking to provide lineage on data and parameters used.
This Playbook aims to provide similar approaches accross a variety of technologies and uses the following components:
A new Synapse workspace and all cluster configuration and notebooks can be deployed from here.
Download and install the Azure CLI
Download and install jq, a lightweight and flexible command-line JSON processor
Azure Data Lake Storage Gen2 storage account - The Azure Synapse workspace needs to be able to read and write to the selected ADLS Gen2 account. In addition, for any storage account that you link as the primary storage account, you must have enabled hierarchical namespace at the creation of the storage account, as described on the Create a Storage Account page. More info on creating Azure Data Lake Storage can be found here. This existing or new account also needs a Blob Container to be created with a chosen name to be supplied below in the environment variable BlobContainerName
. For example, call your container share
.
Login to your Azure Subscription via az login
Clone the Playbook repo:
git clone https://github.com/microsoft/data-discovery-toolkit cd data-discovery-toolkit/environment_preparation/deployment
Rename the file vars.sample
to vars.env
Populate the required variables within the vars.env
file:
# The resource group that your Synapse instance has been provisioned to SynapseResourceGroup= # The region of the Synapse Resource Group Region= # The existing ADLS Storage Account name StorageAccountName= # The existing resource group of the ADLS storage account StorageAccountResourceGroup= # The name of the existing Blob Container within the ADLS Gen 2 storage account mentioned above - also called File Share in some of the notebooks BlobContainerName= # The name of the Synapse Workspace SynapseWorkspaceName= # The Synapse SQL user SqlUser= # The Synapse SQL password SqlPassword= # The Azure subscription id SubscriptionId=
β Grab some coffee as it will take around 30 minutes β
Data Discovery
To use the Synapse components, an Azure Synapse Spark pool is required. Please navigate to Synapse Environment Preparation to configure the cluster for usage.
To use the Azure Cognitive Search FeaturesTo use the Azure Cogitive Search functionality, a provisioned Search instance must be provisioned and ensure that Semantic Search is enabled
Terminology used in this Playbook πNode
- a single infrastructure VM comprised of compute and memoryCluster
- a group of nodesSpark Pool
- a cluster with its associated configuration and sizingClustering
- an unsupervised machine learning technique for grouping similar records togetherADLS
- Azure Data Lake StorageRefer to the Text Clustering section For more detailed information on clustering documents.
The following code accelerators serve as starting points to try approaches that are known to work for the data discovery phase. Note - these accelerators are not intended for production, they will require amendment to incorporate into a production pipeline
Media Type Scenario Description Platform Text Documents Text Clustering Extract features with TF-IDF and cluster documents with built in Search and interactive PowerBI report Synapse Text Documents Text Clustering Extract features with spaCy and cluster documents with built in Search and interactive PowerBI report Synapse Text Documents Text Clustering Extract features with BERT and cluster documents with built in Search and interactive PowerBI report Synapse Text Documents Text Clustering Extract features with Azure OpenAI and cluster documents with built in Search and interactive PowerBI report Synapse Text Documents Text Summarisation Generate abstractive text summaries with Pegasus xsum model with built in Search Synapse Images and videos Image Clustering Extract features from images, make an imagenet prediction and cluster Synapse Images and videos Image Captioning Generate a caption for an image and cluster the captions with built in Search and interactive PowerBI report SynapseSee Environment preparation for Synapse
Example walkthroughs with data π§This section contains some documented common scenarios:
Azure Services used in this repositoryAzure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing and big data analytics. It gives you the freedom to query data on your terms, using either serverless or dedicated options β at scale. Azure Synapse brings these worlds together with a unified experience to ingest, explore, prepare, transform, manage and serve data for immediate BI and machine learning needs.
PowerBI. Connect to and visualize any data using the unified, scalable platform for self-service and enterprise business intelligence (BI) thatβs easy to use and helps you gain deeper data insight.
Azure Cognitive Search is a fully managed search as a service to reduce complexity and scale easily including:
Graphframes is a package for Apache Spark which provides DataFrame-based Graphs. It provides high-level APIs in Scala, Java, and Python. It aims to provide both the functionality of GraphX and extended functionality taking advantage of Spark DataFrames. This extended functionality includes motif finding, DataFrame-based serialization, and highly expressive graph queries.
The notebooks contain a basic graph implementation that can be amended to run functions such as BFS, DFS, find communities and label propagation amongst others.
Datasets used in this repository πΎ Dataset Description Labels BBC sports Consists of 737 documents from the BBC Sport website corresponding to sports news articles in five topical areas from 2004-2005 Class Labels: 5 (athletics, cricket, football, rugby, tennis)This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services.Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
Please refer to the following references for additional relevant material:
Azure Cognitive Search additional linksRetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4