Binary Large Object (BLOB)

Ultimately, all files are stored as binary data (1’s and 0’s), but in the human-readable formats discussed above, the bytes of binary data are mapped to printable characters (typically through a character encoding scheme such as ASCII or Unicode). Some file formats however, particularly for unstructured data, store the data as raw binary that must be interpreted by applications and rendered. Common types of data stored as binary include images, video, audio, and application-specific documents.

When working with data like this, data professionals often refer to the data files as BLOBs (Binary Large Objects).

Optimized file formats

While human-readable formats for structured and semi-structured data can be useful, they’re typically not optimized for storage space or processing. Over time, some specialized file formats that enable compression, indexing, and efficient storage and processing have been developed.

Some common optimized file formats you might see include AvroORC, and Parquet:

  • Avro is a row-based format. It was created by Apache. Each record contains a header that describes the structure of the data in the record. This header is stored as JSON. The data is stored as binary information. An application uses the information in the header to parse the binary data and extract the fields it contains. Avro is a good format for compressing data and minimizing storage and network bandwidth requirements.
  • ORC (Optimized Row Columnar format) organizes data into columns rather than rows. It was developed by HortonWorks for optimizing read and write operations in Apache Hive (Hive is a data warehouse system that supports fast data summarization and querying over large datasets). An ORC file contains stripes of data. Each stripe holds the data for a column or set of columns. A stripe contains an index into the rows in the stripe, the data for each row, and a footer that holds statistical information (count, sum, max, min, and so on) for each column.
  • Parquet is another columnar data format. It was created by Cloudera and X. A Parquet file contains row groups. Data for each column is stored together in the same row group. Each row group contains one or more chunks of data. A Parquet file includes metadata that describes the set of rows found in each chunk. An application can use this metadata to quickly locate the correct chunk for a given set of rows, and retrieve the data in the specified columns for these rows. Parquet specializes in storing and processing nested data types efficiently. It supports very efficient compression and encoding schemes.

Explore file storage

The ability to store data in files is a core element of any computing system. Files can be stored in local file systems on the hard disk of your personal computer, and on removable media such as USB drives; but in most organizations, important data files are stored centrally in some kind of shared file storage system. Increasingly, that central storage location is hosted in the cloud, enabling cost-effective, secure, and reliable storage for large volumes of data.

The specific file format used to store data depends on a number of factors, including:

  • The type of data being stored (structured, semi-structured, or unstructured).
  • The applications and services that will need to read, write, and process the data.
  • The need for the data files to be readable by humans, or optimized for efficient storage and processing.

Some common file formats are discussed below.

Delimited text files

Data is often stored in plain text format with specific field delimiters and row terminators. The most common format for delimited data is comma-separated values (CSV) in which fields are separated by commas, and rows are terminated by a carriage return / new line. Optionally, the first line may include the field names. Other common formats include tab-separated values (TSV) and space-delimited (in which tabs or spaces are used to separate fields), and fixed-width data in which each field is allocated a fixed number of characters. Delimited text is a good choice for structured data that needs to be accessed by a wide range of applications and services in a human-readable format.

Identify data formats

Data is a collection of facts such as numbers, descriptions, and observations used to record information. Data structures in which this data is organized often represent entities that are important to an organization (such as customers, products, sales orders, and so on). Each entity typically has one or more attributes, or characteristics (for example, a customer might have a name, an address, a phone number, and so on).

You can classify data as structuredsemi-structured, or unstructured.

Structured data

Structured data is data that adheres to a fixed schema, so all of the data has the same fields or properties. Most commonly, the schema for structured data entities is tabular – in other words, the data is represented in one or more tables that consist of rows to represent each instance of a data entity, and columns to represent attributes of the entity. For example, the following image shows tabular data representations for Customer and Product entities.

Structured data is often stored in a database in which multiple tables can reference one another by using key values in a relational model; which we’ll explore in more depth later.

Semi-structured data

Semi-structured data is information that has some structure, but which allows for some variation between entity instances. For example, while most customers may have an email address, some might have multiple email addresses, and some might have none at all.

One common format for semi-structured data is JavaScript Object Notation (JSON). The example below shows a pair of JSON documents that represent customer information. Each customer document includes address and contact information, but the specific fields vary between customers.

Explore the power of autonomous development assistance

GitHub Copilot Agent Mode significantly enhances traditional AI-assisted coding by autonomously handling complex, multi-step tasks and continuously iterating on its solutions. Understanding this capability allows developers to streamline workflows, optimize productivity, and effectively balance automation with human oversight.

Autonomous operation

Copilot Agent Mode independently analyzes coding requests, dynamically identifies relevant files, determines appropriate terminal commands, and implements comprehensive solutions without explicit step-by-step instructions.

Example

Task: Create a new REST API endpoint.

Agent Mode autonomously:

  • Creates API routes (routes/api.js)
  • Updates main application (app.js)
  • Installs necessary dependencies (npm install express)
  • Generates test cases (tests/api.test.js)

Although highly autonomous, Agent Mode provides developers with complete transparency and control over each proposed change.

Handling complex, multi-step tasks

Going beyond simple code suggestions, Agent Mode excels in breaking down complex tasks into structured, sequential actions. This capability significantly reduces manual workload and speeds up complex project operations.

Human-AI interaction and global implications

As technology advances, human and AI interaction grows more important. AI isn’t just for automation; it’s transforming industries, improving our lives, and sparking innovation. This video examines AI’s societal impact and the key considerations for its integration.

AI is changing industries by enabling data-driven decisions, automating processes, and fostering innovation. By following these guidelines, we can harness the power of AI responsibly and shape a future that aligns with our shared values and aspirations.

Data privacy: Balance the need for data with the protection of individual privacy rights.

Algorithmic bias: Detect and mitigate biases in AI systems to prevent societal biases from being reflected.

Transparency: Ensure AI decision-making processes are clear and understandable to foster trust and accountability.

Legal liability: Address responsibility for AI decisions, considering the roles of developers, users, and the AI itself.

Innovation and accountability: Striking a balance between innovation and accountability is essential for responsible AI usage.

Data sharing: Encourage data sharing while safeguarding privacy to improve AI systems without compromising individual rights.

AI research: Invest in AI research to fuel innovation and ensure the benefits of AI are accessible to all.

Digital education: Promote digital education and workforce development to equip people with the skills needed in an AI-transformed job market.

AI advisory committees: Establish AI advisory committees to provide oversight, insights, and guidance on AI development and deployment.

Government engagement: Engage with government officials to shape policies that affect AI use in communities.

These guidelines are designed to ensure the responsible and ethical use of AI, fostering a positive and fair influence on society.

Deepfakes and copyright in AI

Deepfake technology added complexity to the rapid production and distribution of digital content. AI-generated deepfakes, which can mimic real people, pose serious ethical and legal issues, especially concerning copyright and intellectual property.

This video investigates deepfakes and their impacts, highlighting strategies to protect creators’ rights and verify digital content. It addresses the challenges of deepfakes, industry’s fight against misinformation, and emerging legal frameworks.

Principles of responsible AI

It is essential to develop AI systems based on trustworthy, fair, and privacy-conscious principles. This video outlines six key principles: accountability, inclusiveness, reliability & safety, fairness, transparency, privacy & security. Discover how these principles foster trust and help create AI systems that respect individual rights and serve society.The journey to responsible AI begins with trust, a trust that is built on these six principles:

Accountability: Define clear roles and responsibilities for AI impacts.

Inclusiveness: Ensuring AI benefits everyone and is accessible to all.

Reliability & Safety: Extensive testing, validation, ongoing monitoring, and safety protocols with robust error handling.

Fairness: Equitable treatment of all individuals, with regular assessments to prevent bias.

Transparency: Enable users to comprehend AI decisions in order to build trust.

Privacy & Security: Safeguard user data, collect only necessary information, and implement strong security protocols.

These principles aim to develop AI systems that benefit society while upholding individual rights and values.

Using AI responsibly: Best practices

Responsible collaboration with AI is essential. Here are some key best practices:

Understand AI: It’s important to grasp the basics of AI and its broad capabilities. This knowledge is the foundation for using AI effectively.

Stay informed: Keep up with the latest advancements and ethical discussions in AI. This will help you leverage AI responsibly.

Recognize AI’s blind spots: AI can reflect societal biases present in the data it learns from. Actively seek unbiased information and understand how AI uses data to navigate these blind spots.

Prioritize safety and privacy: Your data is precious. Choose AI services that value user privacy and prioritize security and transparency.

Cross-verify AI-generated content: Don’t accept AI-generated content at face value. Always cross-verify information from various sources and engage your critical thinking skills.

Evaluate and refine content: Use critical thinking to evaluate and refine AI-generated content by verifying facts and sources, understanding the content’s goals and target audience while considering a range of viewpoints.

Ensure clear policies: Make sure the AI tool or service you’re using has clear policies and guidelines for secure usage.

Promote AI for good: AI should be a tool for good, aiding in areas like healthcare, education, and environmental conservation.

Join the conversation: Start discussions in your community and workplace about responsible AI use. Encourage people to think about how AI will be utilized and take steps to prevent misuse.

Describe factors that can affect costs in Azure

Azure shifts development costs from the capital expense (CapEx) of building out and maintaining infrastructure and facilities to an operational expense (OpEx) of renting infrastructure as you need it, whether it’s compute, storage, networking, and so on.

That OpEx cost can be impacted by many factors. Some of the impacting factors are:

  • Resource type
  • Consumption
  • Maintenance
  • Geography
  • Subscription type
  • Azure Marketplace

Resource type

A number of factors influence the cost of Azure resources. The type of resources, the settings for the resource, and the Azure region will all have an impact on how much a resource costs. When you provision an Azure resource, Azure creates metered instances for that resource. The meters track the resources’ usage and generate a usage record that is used to calculate your bill.

Text to speech

The text to speech API enables you to convert text input to audible speech, which can either be played directly through a computer speaker or written to an audio file.

Speech synthesis voices: When you use the text to speech API, you can specify the voice to be used to vocalize the text. This capability offers you the flexibility to personalize your speech synthesis solution and give it a specific character.

The service includes multiple pre-defined voices with support for multiple languages and regional pronunciation, including neural voices that leverage neural networks to overcome common limitations in speech synthesis with regard to intonation, resulting in a more natural sounding voice. You can also develop custom voices and use them with the text to speech API