Technical Concepts
Terms and definitions important for technical PMs. Definitions courtesy of TritonGPT.
SaaS, PaaS, -aaS
SaaS stands for Software as a Service, which refers to software applications delivered over the internet. Instead of installing and running software on their own computers, users can access software applications via the internet, typically through a web browser. Examples of SaaS include Microsoft Office 365, Salesforce, and Google Workspace.
PaaS stands for Platform as a Service, which provides a complete platform for developing, running, and managing software applications. It includes the operating system, middleware, and development tools, and allows developers to focus on writing code without worrying about the underlying infrastructure. Examples of PaaS include AWS Elastic Beanstalk, Google App Engine, and Azure App Service.
-aaS refers to anything delivered "as a service" over the internet, such as Infrastructure as a Service (IaaS), Desktop as a Service (DaaS), and Security as a Service (SECaaS).
Integrations vs Middleware
Middleware refers to software that connects and integrates different applications, systems, and services. It acts as a bridge between two or more systems, enabling them to communicate and exchange data. Middleware can be used for various purposes, such as data integration, application integration, and security. Examples of middleware include Apache Kafka, AWS Lambda, and Microsoft Azure Logic Apps.
Integrations, on the other hand, refer to the process of connecting and integrating different systems, applications, and services. Integrations can be achieved through middleware, APIs, or other integration technologies. Integrations enable data exchange, workflow automation, and other interactions between systems.
iPaaS tools at UCSD:
NiFi is a data integration tool that provides real-time event-driven processing and easy data ingestion. It is designed to handle large amounts of data and provide low-latency processing, making it a suitable choice for applications that require real-time data processing and event-driven architecture.
In comparison, Kafka is a distributed streaming platform that is designed to handle high-throughput and provides low-latency, fault-tolerant, and scalable data processing. It is often used for real-time data streaming and event-driven architecture, but it may not be the best choice for batch transfers.
Batch transfers, on the other hand, refer to the process of transferring large amounts of data in batches rather than in real-time. This approach is often used for tasks such as data warehousing, data migration, and data synchronization. While NiFi can handle batch transfers, Kafka is not designed for batch processing and is better suited for real-time data streaming.
API Manager is a tool that helps manage, retrieve, and rotate Application Programming Interface (API) credentials throughout their lifecycles. It is a feature of the ServiceNow platform, which is a cloud-based IT service management software used by UC San Diego. API Manager helps improve security posture by avoiding hard-coded credentials in application source code. Instead, it stores credentials securely and retrieves them dynamically when needed.
Airflow is a free and open-source platform used to programmatically schedule and monitor workflows. It was originally developed at Airbnb and is now maintained by Apache. Airflow allows users to create DAGs (directed acyclic graphs) to model workflows, which can include tasks such as data processing, data transformation, and data loading.
Airflow provides a web-based interface for creating, scheduling, and monitoring DAGs, as well as a command-line interface for creating and managing DAGs programmatically. DAGs in Airflow can be created using Python code, which allows for flexibility and customization.
Airflow is designed to handle complex workflows and can handle failures and retries, making it a reliable choice for data processing and machine learning workflows. It also provides features such as dependency management, variable substitution, and sensor management, which make it easy to create and manage complex workflows.
Some of the key features of Airflow include:
DAGs: Airflow allows users to create DAGs, which are directed acyclic graphs that represent workflows. DAGs can include tasks such as data processing, data transformation, and data loading.
Tasks: Airflow provides a variety of tasks that can be used to perform specific operations, such as PythonOperator, BashOperator, and SQLOperator.
Triggers: Airflow allows users to trigger DAGs based on specific events, such as a file being added to a directory or a database table being updated.
Sensors: Airflow provides sensors that can be used to check the state of a task or a variable, such as whether a file exists or whether a variable has a specific value.
Variables: Airflow allows users to define variables that can be used within DAGs, making it easy to pass data between tasks.
Dependency management: Airflow allows users to define dependencies between tasks, ensuring that tasks are executed in the correct order.
Retry mechanism: Airflow provides a retry mechanism that allows users to specify the number of times a task should be retried if it fails.
Web-based interface: Airflow provides a web-based interface for creating, scheduling, and monitoring DAGs, making it easy to manage workflows.
Command-line interface: Airflow also provides a command-line interface for creating and managing DAGs programmatically.
Airflow is widely used in industry and academia for a variety of use cases, including data processing, machine learning, and data engineering. It is particularly useful for workflows that involve multiple tasks and dependencies, and for workflows that need to be executed repeatedly.
SDLC
SDLC stands for Software Development Life Cycle, which refers to the process of developing, testing, and deploying software. It includes various stages, such as planning, requirements gathering, design, development, testing, deployment, and maintenance. SDLC methodologies can vary, such as Agile, Waterfall, and DevOps.
CICD stands for Continuous Integration and Continuous Deployment, which refers to the automation of software development processes, such as building, testing, and deployment. CICD tools include Jenkins, GitLab CI/CD, and CircleCI.
Bamboo is a CICD tool by Atlassian that automates software development processes, such as building, testing, and deployment.
Github is a web-based platform for version control and collaboration on software development projects.
Artifactory is a universal artifact repository manager that helps manage software artifacts and dependencies across various platforms and technologies.
CMDB stands for Configuration Management Database, which refers to a database that stores information about IT assets, configurations, and their relationships. It helps organizations manage and track changes in their IT infrastructure.
CAB
Security
Shibboleth is an open-source web single-sign-on software that enables users to access multiple web applications with a single set of login credentials.
AD stands for Active Directory, which is a directory service developed by Microsoft that provides centralized management of networked resources.
Certificate management
DNS
Duo
Monitoring
Data dog is a monitoring and analytics platform that helps organizations monitor their applications, infrastructure, and logs.
Splunk is a data-to-everything platform that enables organizations to monitor, analyze, and act on their data.
NAGIOS
Data
Database: A database is a collection of organized data that is stored in a way that allows for efficient retrieval and manipulation.
Hub vs data warehouse vs data lake:
Hub: A hub is a centralized repository that stores data from multiple sources, allowing for data integration and analysis.
Data warehouse: A data warehouse is a large, centralized repository that stores data from various sources, typically used for reporting and business intelligence.
Data lake: A data lake is a repository that stores raw, unprocessed data in its native format, allowing for flexibility in data analysis and processing.
Stored proc: A stored procedure is a precompiled database program that performs a specific task, such as data retrieval or manipulation.
Views: A view is a virtual table based on a SELECT statement that provides a simplified interface for accessing data in a database.
Column group: A column group is a way of grouping columns in a database table to facilitate data analysis and reporting.
Column clause group: A column clause group is a way of grouping columns in a SQL query using the GROUP BY clause, allowing for aggregation and summarization of data.
Infrastructure and cloud
Infrastructure:
Definition: The underlying systems and structures that support a business or organization's operations, such as computing, storage, networking, and security.
Types:
Hardware infrastructure: The physical components of a computer system, such as servers, routers, and switches.
Software infrastructure: The programs and operating systems that manage and control hardware infrastructure, such as databases, middleware, and operating systems.
Network infrastructure: The communication systems and protocols that connect devices and allow them to exchange data, such as the internet, Wi-Fi, and Bluetooth.
Importance: Infrastructure is critical to the success of any business or organization, as it provides the foundation for all operations and systems. Without proper infrastructure, businesses may experience downtime, data loss, and other issues that can impact productivity and revenue.
Cloud:
Definition: A model of delivering computing services over the internet, where resources such as servers, storage, and applications are provided and managed by a third-party provider.
Types:
Public cloud: A cloud environment that is owned and operated by a third-party provider, such as Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform (GCP).
Private cloud: A cloud environment that is owned and operated by a single organization, typically for its own use.
Hybrid cloud: A combination of public and private cloud environments, where an organization uses both to meet its computing needs.
Benefits:
Scalability: Cloud environments can quickly scale up or down to meet changing business needs, without the need for expensive hardware upgrades.
Cost-effectiveness: Cloud environments can reduce costs by eliminating the need for hardware maintenance, upgrades, and replacements.
Flexibility: Cloud environments offer a range of services, such as infrastructure as a service (IaaS), platform as a service (PaaS), and software as a service (SaaS), allowing organizations to choose the best fit for their needs.
In summary, infrastructure refers to the underlying systems and structures that support business operations, while cloud refers to a model of delivering computing services over the internet. Both are important for businesses to operate efficiently and effectively, and organizations may choose to use a combination of both to meet their computing needs.