This put up is co-written by Fan Zhang, Sr Principal Engineer / Architect from Palo Alto Networks.
Palo Alto Networks’ System Safety group wished to detect early warning indicators of potential manufacturing points to offer extra time to SMEs to react to those rising issues. The first problem they confronted was that reactively processing over 200 million every day service and utility log entries resulted in delayed response occasions to those important points, leaving them in danger for potential service degradation.
To deal with this problem, they partnered with the AWS Generative AI Innovation Heart (GenAIIC) to develop an automatic log classification pipeline powered by Amazon Bedrock. The answer achieved 95% precision in detecting manufacturing points whereas decreasing incident response occasions by 83%.
On this put up, we discover learn how to construct a scalable and cost-effective log evaluation system utilizing Amazon Bedrock to rework reactive log monitoring into proactive concern detection. We talk about how Amazon Bedrock, by means of Anthropic’ s Claude Haiku mannequin, and Amazon Titan Textual content Embeddings work collectively to mechanically classify and analyze log information. We discover how this automated pipeline detects important points, study the answer structure, and share implementation insights which have delivered measurable operational enhancements.
Palo Alto Networks presents Cloud-Delivered Safety Providers (CDSS) to sort out gadget safety dangers. Their resolution makes use of machine studying and automatic discovery to offer visibility into related units, implementing Zero Belief rules. Groups going through comparable log evaluation challenges can discover sensible insights on this implementation.
Resolution overview
Palo Alto Networks’ automated log classification system helps their System Safety group detect and reply to potential service failures forward of time. The answer processes over 200 million service and utility logs every day, mechanically figuring out important points earlier than they escalate into service outages that affect clients.
The system makes use of Amazon Bedrock with Anthropic’s Claude Haiku mannequin to know log patterns and classify severity ranges, and Amazon Titan Textual content Embeddings allows clever similarity matching. Amazon Aurora offers a caching layer that makes processing large log volumes possible in actual time. The answer integrates seamlessly with Palo Alto Networks’ present infrastructure, serving to the System Safety group give attention to stopping outages as a substitute of managing advanced log evaluation processes.
Palo Alto Networks and the AWS GenAIIC collaborated to construct an answer with the next capabilities:
- Clever deduplication and caching – The system scales by intelligently figuring out duplicate log entries for a similar code occasion. Quite than utilizing a big language mannequin (LLM) to categorise each log individually, the system first identifies duplicates by means of actual matching, then makes use of overlap similarity, and at last employs semantic similarity provided that no earlier match is discovered. This method cost-effectively reduces the 200 million every day logs by over 99%, to logs solely representing distinctive occasions. The caching layer allows real-time processing by decreasing the necessity for redundant LLM invocations.
- Context retrieval for distinctive logs – For distinctive logs, Anthropic’s Claude Haiku mannequin utilizing Amazon Bedrock classifies every log’s severity. The mannequin processes the incoming log together with related labeled historic examples. The examples are dynamically retrieved at inference time by means of vector similarity search. Over time, labeled examples are added to offer wealthy context to the LLM for classification. This context-aware method improves accuracy for Palo Alto Networks’ inner logs and methods and evolving log patterns that conventional rule-based methods battle to deal with.
- Classification with Amazon Bedrock – The answer offers structured predictions, together with severity classification (Precedence 1 (P1), Precedence 2 (P2), Precedence 3 (P3)) and detailed reasoning for every determination. This complete output helps Palo Alto Networks’ SMEs rapidly prioritize responses and take preventive motion earlier than potential outages happen.
- Integration with present pipelines for motion – Outcomes combine with their present FluentD and Kafka pipeline, with information flowing to Amazon Easy Storage Service (Amazon S3) and Amazon Redshift for additional evaluation and reporting.
The next diagram (Determine 1) illustrates how the three-stage pipeline processes Palo Alto Networks’ 200 million every day log quantity whereas balancing scale, accuracy, and cost-efficiency. The structure consists of the next key elements:
- Information ingestion layer – FluentD and Kafka pipeline and incoming logs
- Processing pipeline – Consisting of the next phases:
- Stage 1: Good caching and deduplication – Aurora for actual matching and Amazon Titan Textual content Embeddings for semantic matching
- Stage 2: Context retrieval – Amazon Titan Textual content Embeddings to allow historic labeled examples, and vector similarity search
- Stage 3: Classification – Anthropic’s Claude Haiku mannequin for severity classification (P1/P2/P3)
- Output layer – Aurora, Amazon S3, Amazon Redshift, and SME overview interface
The processing workflow strikes by means of the next phases:
- Stage 1: Good caching and deduplication – Incoming logs from Palo Alto Networks’ FluentD and Kafka pipeline are instantly processed by means of an Aurora primarily based caching layer. The system first applies actual matching, then falls again to overlap similarity, and at last makes use of semantic similarity by means of Amazon Titan Textual content Embeddings if no earlier match is discovered. Throughout testing, this method recognized that greater than 99% of logs corresponded to duplicate occasions, though they contained totally different time stamps, log ranges, and phrasing. The caching system decreased response occasions for cached outcomes and decreased pointless LLM processing.
- Stage 2: Context retrieval for distinctive logs – The remaining lower than 1% of really distinctive logs require classification. For these entries, the system makes use of Amazon Titan Textual content Embeddings to establish probably the most related historic examples from Palo Alto Networks’ labeled dataset. Quite than utilizing static examples, this dynamic retrieval makes certain every log receives contextually acceptable steering for classification.
- Stage 3: Classification with Amazon Bedrock – Distinctive logs and their chosen examples are processed by Amazon Bedrock utilizing Anthropic’s Claude Haiku mannequin. The mannequin analyzes the log content material alongside related historic examples to supply severity classifications (P1, P2, P3) and detailed explanations. Outcomes are saved in Aurora and the cache and built-in into Palo Alto Networks’ present information pipeline for SME overview and motion.
This structure allows cost-effective processing of large log volumes whereas sustaining 95% precision for important P1 severity detection. The system makes use of rigorously crafted prompts that mix area experience with dynamically chosen examples:
system_prompt = """
You might be an skilled log evaluation system chargeable for classifying manufacturing system logs primarily based on severity. Your evaluation helps engineering groups prioritize their response to system points and preserve service reliability.
P1 (Important): Requires speedy motion - system-wide outages, repeated utility crashes
P2 (Excessive): Warrants consideration throughout enterprise hours - efficiency points, partial service disruption
P3 (Low): Might be addressed when assets obtainable - minor bugs, authorization failures, intermittent community points
2024-08-17 01:15:00.00 [warn] failed (104: Connection reset by peer) whereas studying response header from upstream
severity: P3
class: Class A
2024-08-18 17:40:00.00 Error: Request failed with standing code 500 at settle
severity: P2
class: Class B
Log: {incoming_log_snippet}
Location: {system_location}
"""
Present severity classification (P1/P2/P3) and detailed reasoning.
Implementation insights
The core worth of Palo Alto Networks’ resolution lies in making an insurmountable problem manageable: AI helps their group analyze 200 million of every day volumes effectively, whereas the system’s dynamic adaptability makes it attainable to increase the answer into the longer term by including extra labeled examples. Palo Alto Networks’ profitable implementation of their automated log classification system yielded key insights that may assist organizations constructing production-scale AI options:
- Steady studying methods ship compounding worth – Palo Alto Networks designed their system to enhance mechanically as SMEs validate classifications and label new examples. Every validated classification turns into a part of the dynamic few-shot retrieval dataset, bettering accuracy for comparable future logs whereas growing cache hit charges. This method creates a cycle the place operational use enhances system efficiency and reduces prices.
- Clever caching allows AI at manufacturing scale – The multi-layered caching structure processes greater than 99% of logs by means of cache hits, remodeling costly per-log LLM operations into an economical system able to dealing with 200 million every day volumes. This basis makes AI processing economically viable at enterprise scale whereas sustaining response occasions.
- Adaptive methods deal with evolving necessities with out code modifications – The answer accommodates new log classes and patterns with out requiring system modifications. When efficiency wants enchancment for novel log sorts, SMEs can label further examples, and the dynamic few-shot retrieval mechanically incorporates this information into future classifications. This adaptability permits the system to scale with enterprise wants.
- Explainable classifications drive operational confidence – SMEs responding to important alerts require confidence in AI suggestions, significantly for P1 severity classifications. By offering detailed reasoning alongside every classification, Palo Alto Networks allows SMEs to rapidly validate selections and take acceptable motion. Clear explanations remodel AI outputs from predictions into actionable intelligence.
These insights show how AI methods designed for steady studying and explainability turn out to be more and more precious operational belongings.
Conclusion
Palo Alto Networks’ automated log classification system demonstrates how generative AI powered by AWS helps operational groups handle huge volumes in actual time. On this put up, we explored how an structure combining Amazon Bedrock, Amazon Titan Textual content Embeddings, and Aurora processes 200 million of every day logs by means of clever caching and dynamic few-shot studying, enabling proactive detection of important points with 95% precision. Palo Alto Networks’ automated log classification system delivered concrete operational enhancements:
- 95% precision, 90% recall for P1 severity logs – Important alerts are correct and actionable, minimizing false alarms whereas catching 9 out of 10 pressing points, leaving the remaining alerts to be captured by present monitoring methods
- 83% discount in debugging time – SMEs spend much less time on routine log evaluation and extra time on strategic enhancements
- Over 99% cache hit price – The clever caching layer processes 20 million every day quantity cost-effectively by means of subsecond responses
- Proactive concern detection – The system identifies potential issues earlier than they affect clients, stopping the multi-week outages that beforehand disrupted service
- Steady enchancment – Every SME validation mechanically improves future classifications and will increase cache effectivity, leading to decreased prices
For organizations evaluating AI initiatives for log evaluation and operational monitoring, Palo Alto Networks’ implementation presents a blueprint for constructing production-scale methods that ship measurable enhancements in operational effectivity and price discount. To construct your personal generative AI options, discover Amazon Bedrock for managed entry to basis fashions. For added steering, take a look at the AWS Machine Studying assets and browse implementation examples within the AWS Synthetic Intelligence Weblog.
The collaboration between Palo Alto Networks and the AWS GenAIIC demonstrates how considerate AI implementation can remodel reactive operations into proactive, scalable methods that ship sustained enterprise worth.
To get began with Amazon Bedrock, see Construct generative AI options with Amazon Bedrock.
In regards to the authors




