Stack Monitoring

Oracle Cloud Infrastructure (OCI) Stack Monitoring provides essential monitoring and alarm management for applications and infrastructure, such as hosts, GPUs, databases, and app servers.

OCI Stack Monitoring: Service Overview (9:23)
  • Monitor apps and infrastructure in one place

    Eliminate siloes and get full-stack visibility into the health of your Oracle E-Business Suite, PeopleSoft, and GoldenGate systems as well as infrastructure, such as hosts, GPUs, databases, and app servers running on-premises or in the cloud. Extend monitoring to any infrastructure with Prometheus, collectd, or Telegraf integration. Monitor any unique condition with custom metrics.

  • Manage alarms at scale

    Use Monitoring Templates to manage all alarm conditions for your E-Business Suite or PeopleSoft applications, database systems, app servers, and fleet of hosts, including GPU infrastructure. Alarm settings are automatically applied as your environment grows. During patching windows, you can suppress alarms easily using topology-aware Maintenance Windows.

  • Monitor GPU infrastructure and workloads

    With turnkey monitoring of GPU infrastructure, use the Enterprise Health and Alarms view to interactively monitor across the GPU fleet. Triage open alarms, assess activity across all GPUS, monitor GPU utilization, track GPU temperatures, and identify underutilized GPUs and errors. Monitor workload processes and correlate with underlying infrastructure.

How Stack Monitoring works

OCI Stack Monitoring provides full-stack monitoring and alarm management of applications and infrastructure resources running on-premises or in the cloud. Stack Monitoring’s app-specific logic is bundled as a plugin to the agent that, in turn, is used to discover resources and collect metrics which are then sent to OCI. Stack Monitoring creates resources and application topologies from discovery results. Status and key performance data are shown in the Enterprise Health and Alarms user interface (UI), resource homepages, and fleet and application dashboards. Alarm creation is streamlined using Monitoring Templates that create alarms in OCI Monitoring and are automatically shown and summarized in the Stack Monitoring UI. Using machine learning, baselines for key performance metrics are automatically calculated and anomalies are highlighted in performance charts.

Stack Monitoring customer stories

See all customer stories

Stack Monitoring use cases

  • Monitor Oracle Applications, including E-Business Suite and PeopleSoft

    Discover and monitor all components of your E-Business Suite application deployment including Concurrent Manager, Workflow Background Engine, Notification Mailer, and the dependent WebLogic Servers, Oracle Databases, and hosts. Use the E-Business Suite homepage to check the status of all components and open alarms. With Stack View, you can quickly check vital signs across the stack, including E-Business Suite long-running programs, Concurrent Manager requests status, WebLogic memory utilization and thread pool status, Oracle Database wait times, host CPU, and memory.


    Using similar workflows, you can discover and monitor PeopleSoft and its components, including Application Server Domain, PeopleSoft Internet Architecture (PIA), Process Scheduler, and the dependent WebLogic Server, Oracle Database, and hosts. Use the PeopleSoft homepage to check the status of all components and open alarms. Use the PeopleSoft Stack View to assess the status and load of application server and process scheduler domains and PIA, WebLogic resource usage and stuck threads, Oracle Database wait times, host CPU, and memory.


    Get more details about monitoring Oracle applications

  • Monitor cloud and on-premises hosts

    Start monitoring cloud or on-premises hosts as soon as an agent is deployed on it or as soon as an OCI compute instance is provisioned. Monitor status, alarms, resource usage (CPU, memory, swap, and filesystem usage), and load (disk activity and paging activity) across all cloud and on-premises hosts in a single view. Investigate hosts with high CPU to determine the apps consuming the highest CPU. Use anomalies shown in performance charts to understand if high resource usages are within expected baselines. If needed, monitor conditions specific to your environment using Metric Extensions.


    Get more details about monitoring cloud and on-premises hosts

  • Monitor databases and middleware

    Discover and monitor databases and middleware in one place. Monitor complete Oracle Database systems (including PDBs, Listener, Automatic Storage Management, and Cluster), GoldenGate, and SQL Server databases. Monitor middleware such as WebLogic Servers, Managed File Transfer, SOA, and Oracle HTTP Server, as well as Tomcat, Apache HTTP Server, JBoss, JVM servers, Oracle Identity Manager, and Oracle Unified Directory.


    Use Enterprise Health and Alarms to triage open alarms and understand slow performance, high resource usage, errors across the database, and middleware tiers. Use its interactive charts to dynamically correlate any 2 response and load metrics. Drill down on any performance metric to view historical trends and identify anomalies.


    Get more details about monitoring databases and middleware

  • Add custom metrics

    Monitor conditions that are unique to your environment using Metric Extensions. Follow the Metric Extensions guided workflow to define the metric name and type, custom scripts, or SQL queries. Test the metric in an iterative manner; try out the metric on test resources, review the data, and edit the metric as needed. Once tested, publish and enable the Metric Extension on your resources. Monitor the data from Metric Extensions from any Stack Monitoring UI—homepages, Enterprise Health and Alarms, or dashboards. Enable anomaly detection to automatically learn baselines and identify anomalies in performance charts. Set up alarm rules on Metric Extensions to generate alarms when values cross performance thresholds.


    Get more details about custom metrics

  • Monitor GPU infrastructure

    Monitor the overall health of your GPU infrastructure fleet from a single Enterprise Health and Alarms view. Interact with this view to triage open alarms across hosts and GPUs, track CPU and memory utilization across all hosts, assess GPU activity, memory utilization, power, temperature, and latency across all GPUs. Identify host availability issues or hotspots such as GPUs nearing maximum temperatures. Track errors and underutilized GPUs. Drill down from the enterprise view to a specific cluster network view. Continue troubleshooting using the built-in topology views to drill down from a cluster network to hosts and GPUs within network blocks or local blocks in the cluster network.


    Get more details about monitoring GPU infrastructure

  • Manage alarms across the fleet

    Simplify alarm management for applications, systems, and infrastructure fleet using Monitoring Templates. Monitoring Templates provide a resource-centric way to define and manage all alarm conditions for an E-Business Suite or PeopleSoft application, database system, or a fleet of application servers and hosts. During patching periods, use Maintenance Windows to provide a resource-centric way to mute alarms for a fleet of hosts or app servers or for applications such as E-Business Suite and PeopleSoft.


    Get more details about Monitoring Templates

Stack Monitoring capabilities

GPU infrastructure and workload monitoring

GPU infrastructure monitoring

Discovery of GPU infrastructure topology.

  • Discovers cluster network topology, including network blocks, local block, hosts, and GPUs.
  • Discovers GPUs associated with the hosts within the cluster network topology.
  • Provides built-in topology UIs to navigate across the cluster network topology.

Monitor GPU infrastructure health and workloads

Top-down enterprise health and alarms monitoring.

  • Use Enterprise Health and Alarms for host GPU view to monitor all GPU infrastructure across the fleet.
  • Status region identifies host availability issues.
  • Alarms region aggregates alarms across all hosts and GPUs with drill downs for additional triage.
  • Host performance charts aggregates CPU and memory utilization across all hosts and helps identify outliers.
  • GPU performance charts aggregate performance across all GPUs, including activity, memory utilization, power consumption, temperature, latencies, and ECC errors. Helps identify problem areas, such as high temperatures and errors or unutilized GPUs for additional workloads.
  • Interactive views drill down to historical data or to specific hosts or GPUs for additional troubleshooting.
  • Monitor workload processes and correlate performance with underlying hosts and GPUs.

Discovery of applications and application infrastructure

Simplified discovery

One-click discovery for applications such as Oracle E-Business Suite and PeopleSoft as well as application stack technologies.

  • Discovers all components of E-Business Suite, such as concurrent processing, workflow background engine, and notification mailer as well as the dependent WebLogic Servers.
  • Discovers all components of PeopleSoft and its components, such as application server domain, PIA, process scheduler, and OpenSearch as well as the dependent WebLogic Servers.
  • Discovery support for databases and related resources, such as Oracle Database system (including PDBs, Listener, Automatic Storage Management, and Cluster), SQL Server database, and GoldenGate.
  • Discovery support for middleware such as WebLogic Servers, Managed File Transfer, SOA, Oracle HTTP Server, Tomcat, Apache HTTP Server, Oracle Identity Manager, Oracle Unified Directory, and more.
  • Autodiscovery and monitoring of on-premises hosts and compute with agent deployment.

Application topology

Automatic creation of application topology that associates applications with app servers and databases to enable troubleshooting of issues across the stack.

  • E-Business Suite application topology associates E-Business Suite with the dependent WebLogic Servers and Oracle Database.
  • PeopleSoft application topology associates PeopleSoft with the dependent WebLogic Servers and Oracle Database.
  • WebLogic domain topology associates its WebLogic clusters and WebLogic Servers.
  • Oracle Database systems topology associates Oracle Database, PDBs, Listeners, Cluster, and Automatic Storage Management.
  • GoldenGate topology associates components such as GoldenGate deployment, admin service, distributions service, and extract and replicate.

Monitoring of applications and infrastructure health

Curated monitoring

Each resource type is automatically monitored for key vital signs relating to its availability, response, load, error, and utilization, reducing DevOps’ burden of requiring domain expertise to determine what’s important to monitor.

  • E-Business Suite monitoring includes program running time as well as the status of Concurrent Manager and long-running concurrent requests.
  • PeopleSoft monitoring includes application server domain health and load, process scheduler domain health and load, PIA health and load, Elasticsearch/OpenSearch query, and fetch latencies.
  • WebLogic monitoring includes heap usage, stuck threads, web request rate, and web request processing time.
  • Oracle Database monitoring includes tablespace usage, blocking sessions, database time, FRA usage, and IO throughput.
  • Host monitoring includes CPU, memory, swap, and file system utilization.

Machine learning–based anomaly detection

Anomaly detection enables rapid problem identification and resolution.

  • Provides quick visual identification of resources performing outside historical norms.
  • Uses machine learning to automatically calculate baselines for key performance metrics.
  • Flags anomalous behavior and provides helpful charts and comparisons.

Alarm management at scale

Monitoring Templates provide a resource-oriented way to set alarm rule conditions for an application, system, or fleet of resources.

  • Use Oracle-certified Monitoring Templates for recommended alarm rules for E-Business Suite, PeopleSoft, Oracle Database, WebLogic Server, hosts, and other resource types.
  • Instead of managing individual metric alarm rules, monitoring templates provide a resource-oriented way to specify and manage a full set of alarm conditions and notifications for resources specified in the template.
  • OCI Monitoring alarm rules are automatically generated and updated based on the monitoring template.

Maintenance windows

Maintenance windows provide a resource-oriented way to suppress alarms for resources undergoing maintenance operations.

  • Specify the resources, such as applications, database systems, or hosts, in the maintenance window and all associated alarms will be suppressed.
  • For topology-based applications, such as E-Business Suite or PeopleSoft, maintenance windows will automatically include all members. Hosts in maintenance will automatically include the resources running on the host.
  • One-time and recurring maintenance windows are supported.

Specially curated UI for interactive troubleshooting

Single-pane-of-glass across on-premises and cloud

Use Enterprise Health and Alarms to get visibility across your enterprise and quickly identify outages, open alarms, and performance hot spots.

  • Status region identifies outages.
  • Status region by type enables assessment of the status of a full app stack or system, such as E-Business Suite, PeopleSoft, GoldenGate, or Oracle Database.
  • Alarms region summarizes alarms by severity with drill downs for further investigation.
  • Tier views for app servers, databases, and hosts identify resources with the slowest response and highest utilization.
  • Interactive charts support quick assessment of different metrics and drill downs to historical data.

Homepages for holistic monitoring

Get access to the resource’s status, key performance metrics, alarms, and associated resources.

  • Check status of resource and its related components.
  • Triage any open alarms.
  • Correlate load and performance across time periods.
  • Watch out for pending performance issues through anomalies shown in performance charts.
  • Understand resource dependencies for holistic monitoring and use navigational topology for quick drill downs to dependent resources.

Curated application Stack Views

Stack Views provide rapid insight into the critical KPIs for the application and its underlying infrastructure stack.

  • E-Business Suite Stack View: Check the running times of the top E-Business Suite programs, verify status of Concurrent Manager requests, monitor WebLogic heap utilization, Oracle Database wait times, host CPU, and memory usage.
  • PeopleSoft Stack View: Check status of application server domain health and load and verify server processes are running. Review WebLogic JVM memory utilization and thread pool status, Oracle Database wait times, host CPU, and memory usage.

Dashboards

Unify metrics, traces, and logs across Observability and Management services using dashboards.

  • Use out-of-box dashboards for host, E-Business Suite, PeopleSoft, and Oracle Unified Directory to monitor a fleet of infrastructure and applications.
  • Clone any out-of-box dashboard and extend it to include trace and logs from other Observability and Management services.

Extend and customize monitoring

Metric Extensions

Add custom metrics to monitor conditions unique to your environment.

  • UI-based workflow guides you through the process of creating metric definitions, testing, publishing, and enabling them on your resources.
  • Metric Extensions data automatically appear in resource homepages and can be added to Enterprise Health and Alarms views.
  • Include Metric Extensions in Monitoring Templates to generate alarms when values cross thresholds.
  • Enable anomaly detection on Metric Extensions to show performance anomalies in metric charts.

Importing OCI Service instances

Extend your application topology by associating Stack Monitoring resources with other OCI service instances.

  • Import an OCI service’s metric data into Stack Monitoring to create a new resource instance for that service in Stack Monitoring. For example, you can import OCI Load Balancer to service WebLogic cluster.
  • Associate the new OCI service resource with other resources to enrich your application topology and get unified monitoring visibility across all resources.

Integration with other data sources

Monitor any type of infrastructure with integration with Prometheus, Telegraf, collectd, and process-based custom resources.

  • Prometheus integration creates new resources out of any external source emitting Prometheus data.
  • Telegraf and collectd integration enables monitoring of a wide range of infrastructure and apps.
  • Monitor any app running on a host by identifying the processes that make up the app. It will be automatically monitored for status and CPU and memory utilization.

Get started with Stack Monitoring

Contact sales

Interested in learning more about Stack Monitoring? Let one of our experts help.