The past, present & future of anomaly detection, root cause analysis & incident prediction

24 min readJun 22, 2021

Dr. Helen Gu discusses how InsightFinder started, and why the future of IT Operations revolves around the most effective AI solution for any business.

This transcript is based on a podcast found here.

Dan Turchin:

Good morning, good afternoon, or good evening depending on when it is you’re listening. I am Dan Turchin, and it’s my pleasure to moderate today’s discussion with one of the foremost AI researchers, who also happens to be our CTO and my friend. But before we get started with Helen, I thought I would share a bit of background, starting with a somewhat philosophical question, why are we here?

Dan Turchin:

Well, in the case of AI for IT Operations, we’re here because our tiny little world got very complex over the last say five plus years. Infrastructure is much more complex, there are many more data sources to be monitored, they often send conflicting messages, leading to a high volume of false positive alerts, new patterns are being introduced every day into our infrastructure, things like micro-services and server-less architectures. What happened is, even while all of the infrastructure around us got really, really complex, our ability as humans has not changed. So without literally adding 10X or 1,000X of the resources to manage and monitor these infrastructures, we’re faced with one of two decisions, we can either suffer greater downtime and always play catch up with the business, and often miss SOAs and pay penalties, or we can look for a smarter way to manage infrastructure as it scales.

Dan Turchin:

Today, what we’re going to talk about is the latter of the two, the way to apply machine intelligence to essentially automate away a problem that was created in the first place by machines. Now, we’re going to talk a bit about the value of closed-loop change management, which has been an elusive goal for many of us in this space for say at least 15 years.

Dan Turchin:

Now, Helen and her teams have published over 80 academic papers and they’re responsible for innovative technologies used by leaders at Google, IBM, Dell, Credit Suisse and others. Helen was recently, she’s too modest to talk about it, but was recently given the prestigious 10-Year Award from the Symposium on Cloud Computing for a paper that she and her co-authors published back in 2011, about elastic resource scaling. Helen’s a professor at NC State and she lectures frequently on topics related to anomaly detection and distributed systems.

Dan Turchin:

But before we hand it over to Helen, let’s open the time capsule just a bit, roll back the clock about a decade. I remember in 2010 discussing this principle of what it means to achieve closed-loop change management and auto-remediation with the CIO of a large auto-manufacturing company, and this was in Detroit. We were talking about some of these same principles, what does it mean to wrap a layer of intelligence horizontally across all of the key components in the life cycle? I remember this like it was yesterday, her name was Janice, and she looked at me in the eye and she confidently committed that she was going to automate IT operations across the life cycle by the end of the year. Bless her soul, we need starry-eyed visionaries like that.

Dan Turchin:

The fact is, Janice’s vision hasn’t changed. Thankfully, her ability to deliver on the promise of self-healing infrastructure has. It’s a bit about what we’ll be discussing with Helen today. It turns out that with the right AI, systems can actually self-heal using pattern recognition to feed insights from anomaly detection, through the rest of the life cycle, through incident management, problem management, changing configuration management. It can happen today because we have the right AI. That vision was powerful in 2010, it’s just as powerful today. It was just about a decade ahead of it’s time.

Dan Turchin:

The right AI refers to not just the algorithms, but it’s also the process of ingesting, and storing, and interpreting all that data. When I think back to Janice’s vision from a decade ago, it’s finally something that’s commercially available. One of the things we’re going to discuss today is, what’s the right set of questions to even ask, to know where to start if you’re early in your journey?

Dan Turchin:

Now InsightFinder, we’re proud to say is built on innovations, many of which came from Helen’s labs that span hundreds of person years and multiple decades. We’re going to talk a little bit about how that, quote, “right approach to AI” is being deployed at scale for some of the world’s most complex banks, telcos and consumer brands. I can say with confidence, you’re in for a real treat. Without further ado, let’s welcome today’s guest. Helen, good to have you here, I’m looking forward to this discussion. Let’s start with a little bit about your background and how you got into this field.

Dr. Helen Gu:

Thanks Dan. Hi, everyone. Thanks for joining today’s session. I feel humbled to have this opportunity to share my personal story. Actually, the journey started when I was a PhD student at the University of Illinois Urbana-Champaign, and my adviser, Klara Nahrstedt, is a very famous researcher in the field of quality of service and multimedia.

Dr. Helen Gu:

So, quality of service is actually an academic term for service level agreement or service level objective in industry. QS has been actually a research topic for a long time before I entered the academia world. Multimedia is actually a very interesting application we focused on back then because, any short-term glitches will cause big user dissatisfaction. So, quality of service is a very important aspect of this adaptation. My vision at that time, my PhD research is about to actually provide a distributed system, back then I’d call it a service overlay network. So that was 2001, so it’s almost like 20 years ago.

Dr. Helen Gu:

There’s no concept of cloud, there’s no concept of virtualization. Back then, there’s only a concept called web service. The vision I had for my PhD is that, I will develop a distributed system that users can have arbitrary content. They can actually stream into this service overlay network, and you can compose the web services as demand. So for example, the common application scenario I use is that, you have a bunch of photos and you can stream these photos into this service overlay network, and this overlay network will actually do some like image editing, automatic image editing, and then basically automatic categorization based on the content, and send a nice, organized, automatically generated, beautified the photo album to you. So that was my vision like 20 years ago. Today, it’s no longer imagination it’s reality. So the service is called Instagram and also Google Brain service, or Google Photo service. And also, the infrastructure I was imagining is basically cloud today.

Dr. Helen Gu:

So the background is that, believe it or not, my first research paper is actually using neural network to predict the bandwidth, available bandwidth on the mobile device, and also predict the user preference. Based on the weather, location, and your device, using your network to predict what kind of content, multimedia content you want to see on your device. So back then, of course neural network is not as hot as today. It’s actually a cool topic, it’s just because people don’t believe neural network can produce the accurate enough predictions.

Dr. Helen Gu:

Now today, it seems a lot of things changed, right? So I think our vision did not change, but that a lot of environment… And today we have much more powerful machines, we have tons of data. And also, the other things that inspired me to do research in this area is that, back then, my passion is about building reliable distributed systems. I remember the first keynote I heard in distributed system conference, is delivered by Dr. Alfred Spector, and he was the research hub back then for IBM TJ Watson Research. He delivered a talk called The Conundrum of Distributed Systems.

Dr. Helen Gu:

So the conundrum is basically, we see people… Back then, people think they can just have one server and they run the one web service and then that’s it, they can support all their customers. However, things evolved. Right now, we all know one server is not going to be used sufficient. You need distributed systems. So as we put more functionality and scalability requirements to distributed system, and more features, the distributed system becomes more and more complex. That’s just an unavoidable trend.

Dr. Helen Gu:

As this complexity add on, and today we have like, I will say hundreds of magnitude higher complexity than 20 years ago. So we have virtualization, we have microservices, we have all kinds of new features like AI, stream, and all those things actually add up the complexity in the stack. So today, you can see that we have, like a lot of people talk about this, full stack analysis. It’s just because the stack becomes so complex, it’s no longer just operating system and application. There’s actually all kinds of middle layers in between.

Dr. Helen Gu:

So this is all basically inspires me to look into how to tackle this complexity. Machine learning has been… Actually, I just happened to get interested in that technology, but then later on as we studied more, then we see, “Okay, the machine data is actually particularly amenable for this machine, like neural machine learning algorithms,” because we’re dealing with tons of data. So we don’t have problem, we don’t have data. We have lots of data. That’s not our problem. But the problem is that, how do you grab insights from tons of data? So that’s what eventually the whole research questions I have been asking myself all this time is, how do we actually find the right insight out of this data? People call it data lake, I would say data ocean. So that has been my research for the past like 20 years.

Dan Turchin:

At what point did you realize that the work you were doing in academia might have broader commercial applications?

Dr. Helen Gu:

That’s also a very interesting story, and I never thought about starting a company. So I’ve been in research lab, I worked at IBM Research before I joined the NC State. So I have been always in academia, and has always been as a researcher, a scientist. So we have been [inaudible 00:12:52] and published papers in this area. So I remember our first machine learning driven anomaly prediction paper was published back in 2006. So it was pretty early publication of using machine learning to do anomaly detection. So back then, I think that people, even in academia world, people feel like, “Okay, this kind of technology has value,” but they don’t feel like there’s a bigger potential behind it, because they feel that, it’s hard to achieve high accuracy.

Dr. Helen Gu:

Second, people feel like even you have the prediction, and let’s say you have like 20 minutes or 30 minutes or an hour lead time, and a lot of problems as you handled manually, and they feel like this kind of lead time is not very beneficial for them. Third is that, there’s not a lot of outages maybe. 20 years ago, if we would just deal with one server and probably one person can actually handle that, so you can have a person to babysit a machine. So it was not actually catching a lot of attention. But we keep on because we feel that the data will become more and more, and the system becomes more and more complex.

Dr. Helen Gu:

I think that around that time, also cloud computing starts to take off. So we believe that that’s the inevitable challenge we have to actually tackle. So we keep on working on that, and also, as many researchers in this field… So even today in commercial world, people always start with supervised machine learning, and we also started with supervised machine learning. We used the [inaudible 00:15:01], cultural mix model, and the patient networks, all those run-on technology we tried, and support vector machine. So we achieved some good results using supervised machine learning, but then later on, once we keep on collaborating with industry people. For example, my research has been sponsored by Google, by IBM, and also by Credit Suisse.

Dr. Helen Gu:

So I worked with them together on some of the practical problems they encounter, and we quickly realized, “Okay, this supervised machine learning is really hard to apply in real world because, machine data are highly fluctuating and it’s really hard to correctly label data, not mentioning it’s very labor intensive to get those labor data labels.” So you later need a person to actually tell the machine, “Oh this is good data, this is bad data.” So it’s literally impractical. So we switch the gear and start to actually focus on unsupervised machine learning around the 2010. Then we had our first set of results.

Dr. Helen Gu:

So from 2010 to 2011 until 2012, I think within around like two years period, we tried a lot of unsupervised machine learning algorithms like clustering pieces, nearest neighbor, and we’d get very bad results. So we keep getting bad results because, it’s a lot more challenging to use unsupervised machine learning to do this anomaly detection for distributed systems. The reason is that, the distributed systems are highly dynamic, very fluctuating, there’s a lot of dependencies in distributed systems. So many machine learning models can not capture those dependencies, cannot capture good patterns out of fluctuating data.

Dr. Helen Gu:

So we tried a lot of things, and then we bumped into… So basically, my PhD student, Daniel Dean, he actually… He’s the main student who mainly works on this area with me, and he bumped into a research technique coming from video tracking system. So the technique is the self organizing map. It’s kind of a special neural network technology. It’s still a neural network technology, but it’s a little bit different from traditional multi-layer neural network technology. It’s typically used for tracking the paths of automatic driving cars. For example, you can track the paths and you can actually predict where the car will go. That’s the original, actually the usage of the technique. And we were thinking, “Okay, can we use that for system anomaly detection?”

Dr. Helen Gu:

We tried it for about like five months on this technique, and then we failed. It was a very frustrating experience, and the reason is because, we thought we can track the evolving paths of distributed systems just like we track a car, and that is not the case because, the system is highly unpredictable in the sense that, there’s a lot of variations, there’s a lot of context that is hard to actually quantify using this simple neural network technology.

Dr. Helen Gu:

So then we switch gear, we think, “Okay, can we use it in a different way to actually do that?” And then luckily, we actually come up with a way to actually adapt this kind of technology for anomaly detection. Then we publish a paper back in 2012. And then, our news reporters actually broadcasted the paper basically. They think this is a very interesting technology with the cloud, with the reliability, with the AI. So this news article is read by many leaders from industry, and I got calls from a lot of companies saying like, “Can we use it in real world?”

Dr. Helen Gu:

Because at that time, I think that’s when the cloud kicked off. And if you guys remember back in 2011, AWS has a lot of outages. One of the famous outage they had is that, they have actually two in 2011, one is that, they have a very broad service outage that caused… Netflix went down on Christmas Eve. So that caused a lot of news headlines. The other big outage is that, they lost the whole US East region instances. So that’s when, back then, a lot of cloud service providers realized this is a big issue for them. So one of the companies that actually contacted us is Google. And so actually, my research has been sponsored by Google. So I know people there, so I start to actually collaborate with them.

Dr. Helen Gu:

Then after that, basically I spent the whole year at Google evaluating the ideas in Google infrastructure because, at that time, we only know this is the idea that works in the lab environment, we don’t know whether it works in the real production environment. We never tried it with the real world production data before going to Google. So after spending like almost a year at a Google, we tried over like 20, 30 cloud outages in their production cloud environments, and we saw really encouraging results. So we see better than we expect accuracy. And also, more importantly we see that we can achieve really good lead time. So in the lab environment, because we’re typically simulating the bug in a very small test bed, because we, we cannot afford large test bed, but in Google, what happens is that, at that time we tested with all their infrastructure data. So they have tons of machines, tons of data.

Dr. Helen Gu:

A lot of production system, what we found out is that, they have built in redundancy. It’s very complex, a highly redundant system. So you don’t see one server went down and the whole service went down. You never see that. What you see is that, the problem starts with a few machines and then basically gradually propagates in their big infrastructure, and eventually bring down the whole service. So this is basically where we can actually achieve very good lead time. One of the outage we can achieve is even like more than 24 hours lead time for them. They have sufficient time to actually react to it, and to avoid a big service outage that bring down their production service for seven days.

Dr. Helen Gu:

So this is very encouraging, and Google actually licensed the technology. So then, I got encouraged from my manager, the National Science Foundation. Essentially, as a faculty we always got a sponsor, a majority of our support coming from National Science Foundation. Luckily, the National Science Foundation has actually a very good program called SBIR Grant. So that allows basically any researchers to commercialize their idea, their research prototype. So I actually got encouragement from the program manager and I got a grant from them.

Dr. Helen Gu:

So we get our phase one grant which is $150,000, not much, and then I realized, “Okay, maybe I can try to commercialize it.” So I started a company back in 2016 after an exciting journey at Google. So that’s how InsightFinder get started.

Dan Turchin:

So, I teased a little bit about the recent paper that received the Symposium on Cloud Computing 10-Year Award. Tell us a little bit about the research behind that.

Dr. Helen Gu:

That’s actually another collaboration we did with Google together. So the research started, I think it’s around 2009. So that’s basically a year after I just got my academia career started. Back then, Google has a really nice program called the Google Faculty Award and also Google Faculty Summit. So I met my collaborator, John Wilkes, in that summit. I was telling him my idea at that time is that, cloud just get started. Back then, I don’t think that people have concerns about their cloud bill because they think, “Okay, you run the one machine, cloud is so cheap, right? 10 cents per hour, okay, that’s cheap.”

Dr. Helen Gu:

So they didn’t realize like… We did some math so we think, “Wait a second. If you actually run 1,000 machines and then for a year, you actually pay hundreds of thousands of dollars on that.” The other thing, for example today, when we talk to some of our customers, they pay like tens of millions of dollars to cloud service providers each year. So the cost is something a lot of people didn’t realize back then. But then, we did a projection on this and we realized, “Okay, there will be some concern on that.”

Dr. Helen Gu:

The other thing is that, for me, I always focus on optimization, right? So I work with some users and it bothers me that their utilization is only like 7% or 10%. So they are paying like 100% usage to a cloud service provider, and they only use 7%. The other thing is that, in the traditional data center environment before cloud is started, in the private data center, we always hear people talk about, “Oh, I don’t have enough resources and I need the resources.” So back then, there’s a very common technique I believe everybody experiences, it’s called the overbooking. Airlines are doing that all the time. Sometimes you go to airlines, they say, “Oh, we ran out of seats.” Why? They’re overbooking. That’s one way for the airline to make money, right? They expect some people are not using their resources.

Dr. Helen Gu:

In private data center, they use this technique called overbooking as well. It’s actually quite effective to actually cut down the resource cost, but back then, in cloud there’s no such technique. The other thing is that, since we started with prediction, and so we think, “Okay, when you actually do this kind of resource optimization, you can use prediction to guide in your resource scaling, right?” So particularly back then, virtual machine migration just actually got kind of popular.

Dr. Helen Gu:

So, one of the things we observed, for example, when you try to migrate a virtual machine, if you actually do that, trigger the migration before the host CPU reach a very high level, let’s say before 90%, before CPU reach 90%, you can finish a virtual machine migration in 10 seconds. However, if you actually trigger the migration after the host is already overloaded, let’s say 95% CPU utilization, it took several hours to finish migration. So that’s the power of prediction, right?

Dr. Helen Gu:

I can give you maybe like just 10 seconds, 20 seconds lead time, and allow you to trigger the migration before the CPU hit 90%, and you can get this done seamlessly. But if you don’t have prediction, you wait until the CPU reach 95%, you will never get this done because, it takes hours to finish migration. So this is our motivation back then. And so we realized the power of prediction, and we realized that you have the technology to implement that, so we developed that.

Dr. Helen Gu:

Back then, of course none of the commercial world like cloud service provider has this kind of technique. There’s auto scaling service, but this scaling service is actually not automatic, it’s manual, right? User has to set the threshold and to trigger the migration, and a lot of times they don’t know when they should set, what kind of threshold they should set. More importantly, it’s still reactive. So this is basically, we started that.

Dr. Helen Gu:

And then I think this year, basically that’s 10 years, that marks the 10 years for this research paper, and the committee basically look at it is that, the citations to the paper and the impact to the real world. So essentially, when we look at the auto scaling service, actually the timeline in the commercial world, in 2019, that’s basically just last year, AWS released their first predictive auto scaling service. So it took nine years for the idea to transition into real world. So we were very honored for getting this award, but it’s definitely super rewarding for me personally to see some idea generate real-world impact.

Dan Turchin:

Talking about that journey to commercializing the technology that came out of a research environment, so fast forward, now you’ve commercialized some of the core technologies, and now you’ve introduced it to hundreds of customers around the world, question is, what are some of the common mistakes that you’ve seen customers make when they’re trying to adopt whether it’s anomaly detection or incident prediction?

Dr. Helen Gu:

Definitely. I think this has been a really hot area and a lot of industry leaders looking into this area. Many times for big enterprise companies, they hire data scientists to work together with their engineers to implement a certain kind of intelligent system management techniques. So the common problems that I saw is basically the understanding about assumptions. I think in the research world, assumptions are always very important because, if you don’t understand assumptions, you will have wrong application. So, many machine learning algorithms has assumptions about the data. That’s where I saw the pitfalls is when engineers using those algorithms, they don’t understand assumptions.

Dr. Helen Gu:

For example, for a very simple technique, a PCA, principle component analysis, this is a very common technique used for dimensionality reduction, because we talk about, for infrastructure data, machine data, you often deal with high dimensional data, so a lot of data scientists, they will recommend to you is saying, “Oh, you can use PCA to reduce dimensions, then you can improve the accuracy of anomaly detection.” That’s actually one of the common techniques we saw, also one of the techniques we tried at Google as well.

Dr. Helen Gu:

The problem with PCA is that, it has assumptions. The assumptions of PCA is actually pretty simple. The assumption is that, the data among different dimensions has to be linearly correlated, which means you can derive one dimension from the other dimension using a linear equation. So, only when that linear correlation exists, PCA will be effective, otherwise PCA is not effective. So obviously, if you know these assumptions, you look at the machine data, right? So a data scientist doesn’t know, these assumptions, my assumptions holds in machine data or not, but engineers, they know. They know the system. So if they look at their systems, “Wait second, I don’t see this linear correlation among my system metrics most of the time.” So we all know system resources and the performance are non-linear correlated. This is actually a well-known symptom. So that’s a fundamental problem, right?

Dr. Helen Gu:

So then, data scientists know the assumptions of the algorithms and the engineers know assumptions of the data, and sometimes they don’t talk to each other unless they actually know, “Okay, this is something critical for me, and you need to validate for me before I use it.” But a lot of times people make… Why it’s called assumptions, people take it for granted, right?

Dr. Helen Gu:

The other thing, a very common mistake I saw is the clustering technique. So clustering is probably the most easy to use unsupervised machine learning techniques you can use for anomaly detection. One of the biggest assumption for clustering is that, you need to be able to distinguish two data samples based on Euclidean distance. That’s the fundamental theory behind it, right? So if you have like two data samples, it could be vectors, if you can say, “This data sample is different from this data sample just based on Euclidean distance calculation,” then you are good. But if you think about it in real world, like in machine data, if you look at high dimensions, they did not realize this Euclidean distance calculation is based on assumptions that you don’t have a lot of dimensions.

Dr. Helen Gu:

So for clustering algorithm, pretty much if you have like more than 20 dimensions, your algorithm will no longer be accurate because, Euclidean distance cannot capture the difference in this high dimensional data space. So this is also again, assumptions, right? The other thing is that, for data scientists, they don’t understand basically some intrinsic dependencies between different machines and between different metrics. For example, a lot of times a common problem… So we just talked to one customer yesterday and the data scientist actually raised a question to me, he said, “One of the common things we find is that, when the incident happens and we see a lot of system matrix exhibit anomalies, how do I know which one is the principal ones? How do I know this which is the root cause? Because, if you present all those anomalies to users and to operators, they will think it’s still useless because, they don’t know where to start from.”

Dr. Helen Gu:

So if you understand the systems, then you will understand the intrinsical meanings of these metrics. For example, if I tell you, “There’s actually one CPU metric, one load average metric,” typically when you see anomalies in system matrix, in CPU matrix, you will see anomalies in low average. From a data scientist’s point of view, these two matrix are just two independent matrix. But from a system researcher perspective, these two matrix are actually correlated because, the load average is directly reflected, if you understand the meaning behind it, it’s directly reflected by the CPU usage.

Dr. Helen Gu:

So this is basically some intrinsic dependency information in the system, and also in distributed systems, right? So we know which machine depends on what machine. Those basically you can easily derive from the main knowledge of the application. Those information typically are not actually taken into account by the data scientists. So if you just apply those technology directly to the machine problems, then you will run into all kinds of problems, low accuracy.

Dr. Helen Gu:

The other thing I saw [inaudible 00:37:36] for is that, a lot of people heard about deep learning. Deep learning is very powerful, and if you have sufficient resources, if you… But one of the fundamental problem, the assumptions for deep learning is that, you need to have labeled the training data and you need to have lots of training data. So if you don’t have sufficient training data, you won’t be able to derive a good model. So once you have sufficient training data, yes, deep learning will be very powerful.

Dr. Helen Gu:

So then the question is that, in the real world dynamic production environments, particularly, our recent research is about how to do anomaly detection for containers, because containers are short-lived, right? So you start a container, you do maybe like a few minutes of operation, and you remove the container because it’s so easy to start a container again. So, because those are short-lived attentions behavior, it’s really hard for actually any complex model like deep learning to capture those models. So those are the unique challenges for distributed systems to present to the machine learning world.

Dr. Helen Gu:

The other good example is basically another common prediction technique used by many companies called LSTM. So this is also neural network technology, but then for that kind of system, you have to have a relatively stable trend. So a well-known example is basically Facebook Prophet. Facebook actually open source their prediction algorithms called Prophet. So, I welcome everybody to try it. Behind it is basically some LSTM technology. So if you read their technology carefully, they said, “Okay, it works particularly well with the metrics with seasonal patterns.” That’s because, if you have very highly fluctuating, highly dynamic time series data, LSTM won’t work. So that’s basically… For each machine learning models, they have their assumptions, and when you apply to them, you have to look at your data and ask yourself, “Are my data follow those assumptions?” Otherwise, the machine learning algorithm won’t work.

Dr. Helen Gu:

The other thing also I want to mention is that, a lot of people just give up when they try once and it failed, then they believe machine learning doesn’t work. So essentially, you can imagine no one shoe fits all. So, because we deal with such complex data space in machine data world, you don’t expect you have one technology works for all the data. So, you’ve got to actually have ensemble techniques to actually figure out what kind of data you are dealing with, and use the right technique for the data. This kind of things also has to be automated. You cannot rely on humans to do that. So that’s one of the common problem we saw our customer to deal with is that, they have to actually rebuild the model every time when they see a new data, not mentioning calibrating the model.

Dan Turchin:

So for everyone listening, that’s Facebook’s Prophet with that PH. Although we’re all interested in Facebook’s earnings, it’s the PH and not the F. I just thought I’d make sure that got inserted into the recording. Helen, this has been great. We’re bad at time, but I can’t let you go without asking you one more important question, and that is, let’s say you roll back the clock 10 years ago, you had this insight about the need for elastic resource scaling, and obviously 10 years later, it’s as relevant as ever. Now, polish your crystal ball and tell us what are the big ideas that you think will lead to technological breakthroughs call it, for the next five to 10 years?

Dr. Helen Gu:

Definitely. I feel like machine learning is actually kind of an ideal technology for machine data analysis. So I couldn’t imagine any other area for AI technology to tick off. So we see like automatic driving, we see like image recognition, face recognition and gaming, those are all hot AI applications. However, I believe like ID operations, application, devops, is the perfect word for machine learning because, we have so much data. There’s tons of insights and our operators need help. My husband is a software engineer, and sometimes he had to get up in the middle of night to fix production bugs. I want to help him first, so I want to let people get their sleep back, get time back to work on the problems that matters to them and they are most interested in.

Dr. Helen Gu:

There’s still a lot of open problems, and particularly, my personal research interest recently is really zooming on this causal analysis. I think in the machine learning world, anomaly detection is hard, but I think the causal analysis is really the fundamental, the hardest part of the whole space. We had some recent research papers published in this area that can actually do automatic bug fixing by identifying specific root causes and have come up with the fix automatically. So for example, one of the most recent paper we published called The HangFix, we can actually fix those hang bugs in seconds. So this is actually I think a very exciting area. So there’s a lot of I would say low hanging fruits for people to tackle because, we just have too many problems and those problems are not going away.

Dr. Helen Gu:

Fundamentally, human writes code and human made mistakes, and those mistakes won’t go away quickly. And also, as we rely on this IT infrastructure more and more, and when very quickly those outages will soon mean life-threatening, so every minute, every second we can save for the user from the service outage, I think it will be extremely valuable.

Dan Turchin:

Helen, as that future becomes a reality, can I ask you to come back and talk more about the state-of-the-art?

Dr. Helen Gu:

Sure, certainly. Will be happy.

Dan Turchin:

Good. Well, we’re out of time for today. This has been fantastic. Hopefully everyone in the audience has learned a lot. We will have Helen back. We were just scratching the surface, but a fascinating discussion. For everyone that provided questions in the chat panel that we didn’t get to, I promise we will get to them offline. Thank you for your time and attention. You’ll see our contact information, both Helen’s and mine. I will leave you with a quote that I think is as relevant today as it was two years ago, and it will probably even be more relevant two years from now, “What can be predicted as better left to machines, but what requires intuition or empathy is better left to humans.”

Dan Turchin:

So with that, if you’re interested in learning more about InsightFinder, please go to that URL, Insightfinder.com/request-trial. We’d love to have you join our fast growing user community, expose you to some of the technologies that Dr. GU talked about on today’s discussion. And with that said, thank you once again.

The past, present & future of anomaly detection, root cause analysis & incident prediction

Written by InsightFinder