The right AI is key to an effective IT Operations future.

What to ask when evaluating AI solutions for IT

3 min readJun 20, 2021

Capabilities

Every vendor’s AI seems the same. What is the best approach for using AI to detect anomalies?

Unsupervised machine learning: Look for unsupervised machine learning versus supervised machine learning to avoid hand-labeling machine data. Hand-labeling data is not scalable.
Multiple data inputs: Look for anomaly detection across multiple types of machine data — specifically, logs and metrics plus change events. Detecting the right anomalies requires analyzing data from metrics and logs simultaneously. The best AI technology will allow you to effectively manage all your data and use it to detect the root of the problem.
Dynamic parameter tuning: Avoid algorithms like DBSCAN that require significant parameter tuning to achieve decent accuracy and performance. Instead, seek platforms that have automated parameter tuning, with the option to manually adjust if necessary.
Self learning: Expect self-learning behavior via active learning techniques to ensure user feedback and actions automatically improve model accuracy

If we have an AIOps platform, do we need a separate log analytics tool, event correlation tool, anomaly detection tool, and analytics dashboard?

No. The right AIOps platform should automate the lifecycle of operational incidents by aggregating data, detecting anomalies, supporting incident investigation, predicting future incidents, visualizing business impact, recommending the root cause of incidents, and automating remediation tasks.

Does any system predict and auto-remediate incidents or do they just detect anomalies?

AIOps platforms should not only detect anomalies but also predict future incidents and automate the root cause analysis process to be able to take action before service is disrupted. Expect average lead time of at least two hours before predicted incidents occur.

What if we want to understand business impact beyond just detecting anomalies?

Expect your AIOps platform to support custom incident definition with whitelisting, blacklisting, and feature extraction. Also, be sure you can run analytics based on custom extracted features.

Requirements

How many metrics should an AIOps platform ingested concurrently?

At least 100,000 assuming a low- end requirement of about 5,000 nodes and 20 metrics per node monitored at five-minute intervals.

What volume of log data should an AI platform process?

At least 5GB per core per day with no performance degradation or data loss. Adding server cores should be supported for elastic scaling.

What log compression level should be expected for storage?

At least 90%+ lossless log compression ratio to reduce storage cost.

Are all AIOps platforms only available on-prem?

No. Modern AIOps platforms support SaaS and on-prem deployment modes with feature parity across both.

What architecture is required to scale AIOps systems?

A distributed architecture is required that allows lightweight metric and log worker nodes to be added and data processing jobs to be queued and shared across clustered nodes.

Training

What processing power is required to train AI models?

Commodity CPUs are sufficient. There should be no need for GPUs or TPUs.

Should algorithms require tuning by Data Scientists?

Modern systems don’t require parameter tuning or threshold setting to achieve maximum accuracy.

How long should it take to train a model? Days?

Modern AIOps platforms should achieve high levels of accuracy in minutes with no more than a week or two of data and should improve continuously by ingesting live data streams. Smart systems also support the ability to define maintenance windows and holiday schedules to account for known patterns. Be aware that today’s machine learning models shouldn’t need to be bootstrapped with large volumes of historical data.