Small and Medium Business

주요 과제

OPE-001

Define Monitor and Analyze Customer Workload Health KPIs

고객 워크로드 헬스 KPI 정의

Requirement Description

영문

Define, monitor and analyze customer workload health KPIs AWS Partner has defined metrics for determining the health of each component of the workload and provided the customer with guidance on how to detect operational events based on these metrics. Establish the capability to run, monitor and improve operational procedure by: • Defining, collecting and analyzing workload health metrics w/AWS services or 3rd Party tool • Exporting standard application logs that capture errors and aid in troubleshooting and response to operational events. • Defining threshold of operational metrics to generate alert for any issues Please provide the following as evidence: • Standardized documents or guidance on how to develop customer workload health KPIs with the three components above • Description of how workload health KPIs are implemented in (1) of the submitted customer examples.

한글

**고객 워크로드 상태 KPI 정의, 모니터링 및 분석** AWS 파트너는 워크로드의 각 구성 요소에 대한 상태를 판단하기 위한 메트릭을 정의하고, 이러한 메트릭을 기반으로 운영 이벤트를 감지하는 방법에 대한 가이드를 고객에게 제공해야 합니다. 다음 방법을 통해 운영 절차를 실행, 모니터링 및 개선할 수 있는 역량을 구축하세요: • AWS 서비스 또는 서드파티 도구를 활용하여 워크로드 상태 메트릭 정의, 수집 및 분석 • 오류를 포착하고 운영 이벤트에 대한 문제 해결 및 대응을 지원하는 표준 애플리케이션 로그 내보내기 • 문제 발생 시 알림을 생성할 운영 메트릭의 임계값 정의 **다음 사항을 증빙자료로 제출해 주세요:** • 위 세 가지 구성 요소를 포함한 고객 워크로드 상태 KPI 개발 방법에 대한 표준화된 문서 또는 가이드라인 • 제출한 고객 사례 중 (1)에서 워크로드 상태 KPI가 어떻게 구현되었는지에 대한 설명

한글

제출한 응답

영문

This engagement focused on CloudFront-based static content delivery and image optimization. We implemented CloudFront access log analysis for traffic patterns and cache performance, but did not establish comprehensive operational health KPI frameworks with real-time alerting as the architecture consisted primarily of AWS managed services (S3, CloudFront, Lambda). Our standard monitoring framework is documented in our Workload Health Monitoring Guide and demonstrated in our Happy Edu Tech implementation. --- This engagement focused on marketplace platform modernization architecture and cloud-native design. The customer's internal technical team handled operational implementation, including monitoring and observability configuration. Our scope included providing architectural guidance and best practice recommendations for CloudWatch monitoring setup, but the customer's operations team managed the actual implementation, KPI definition, and operational dashboard development according to their internal standards and business metric requirements. This approach aligned with the customer's preference to maintain operational control and integrate monitoring with their existing internal processes. Our comprehensive monitoring capabilities and KPI frameworks are documented in our Workload Health Monitoring Guide and demonstrated in our [Magazine Platform] customer implementation. --- Our monitoring framework leverages AWS CloudWatch as the primary observability platform, collecting and analyzing workload health across infrastructure, application, and business layers. We establish standardized KPIs through automated metric collection, with CloudWatch Alarms configured for threshold-based and anomaly detection alerting. Structured application logs are centralized in CloudWatch Logs, enabling rapid root cause analysis through CloudWatch Logs Insights queries and custom dashboards for real-time operational visibility. Implementation Example - Happy Edu Tech AI-Powered Math Learning Platform: Deployed comprehensive monitoring covering infrastructure metrics (Lambda concurrent executions, SQS queue depth, GPU instance utilization in Auto Scaling Groups), application performance (OCR processing time, AI inference latency, cache hit rates in ElastiCache, API Gateway request counts), and AI-specific business KPIs (OCR accuracy rates, model inference response times, problem recognition success rates). Configured CloudWatch Alarms with tiered severity levels: critical alerts (OCR accuracy <70%, AI inference latency >10 seconds, SQS message age >5 minutes) triggering immediate SNS notifications to operations team, warning-level alerts (GPU utilization >80%, cache hit rate <60%) sent to monitoring dashboards. Implemented custom CloudWatch metrics via PutMetricData API capturing AI model performance including real-time accuracy tracking by model type, user interaction patterns, and problem-solving success rates, providing the development team with MLOps dashboards showing model health, system performance, and user engagement metrics that trigger automatic retraining when performance degradation is detected. --- Deployed monitoring tracking infrastructure health (EC2/ECS CPU and memory, RDS connections, CloudFront cache performance), application metrics (API response times, Lambda execution duration and errors, S3 upload success rates), and business KPIs (daily page views, active user sessions, content publishing rate). Configured CloudWatch Alarms with critical thresholds (error rate >1%, response time >3 seconds) triggering SNS notifications to operations team, and warning thresholds (CPU >75%, memory >80%) generating Slack alerts. Custom metrics captured magazine-specific events including article publish workflows and reader engagement, displayed on CloudWatch dashboards accessible to editorial and technical teams for real-time platform monitoring. Reference Documentation Defining, Monitoring, and Analyzing Customer Workload Health KPIs: https://wishket-team.notion.site/customer-workload-health-kpis

한글

이번 프로젝트는 CloudFront 기반 정적 콘텐츠 전송 및 이미지 최적화에 중점을 두었습니다. 트래픽 패턴과 캐시 성능을 위한 CloudFront 액세스 로그 분석을 구현했지만, 아키텍처가 주로 AWS 관리형 서비스(S3, CloudFront, Lambda)로 구성되어 있어 실시간 알림 기능을 포함한 포괄적인 운영 상태 KPI 프레임워크는 구축하지 않았습니다. 당사의 표준 모니터링 프레임워크는 Workload Health Monitoring Guide에 문서화되어 있으며, Happy Edu Tech 구현 사례를 통해 실제 적용 사례를 확인할 수 있습니다. --- 이번 프로젝트는 마켓플레이스 플랫폼의 현대화 아키텍처와 클라우드 네이티브 설계에 중점을 두었습니다. 고객사의 내부 기술팀이 모니터링 및 관찰 가능성(observability) 구성을 포함한 운영 구현을 담당했습니다. 당사의 업무 범위는 CloudWatch 모니터링 설정을 위한 아키텍처 지침 및 모범 사례 권장사항 제공이었으며, 고객사의 운영팀이 내부 표준과 비즈니스 지표 요구사항에 따라 실제 구현, KPI 정의, 운영 대시보드 개발을 관리했습니다. 이러한 접근 방식은 운영 통제권을 유지하고 기존 내부 프로세스와 모니터링을 통합하려는 고객사의 선호도와 일치했습니다. 당사의 포괄적인 모니터링 역량과 KPI 프레임워크는 Workload Health Monitoring Guide에 문서화되어 있으며, [Magazine Platform] 고객 구현 사례를 통해 실제 적용 사례를 확인할 수 있습니다. --- 당사의 모니터링 프레임워크는 AWS CloudWatch를 주요 관찰 가능성 플랫폼으로 활용하여 인프라, 애플리케이션, 비즈니스 계층 전반의 워크로드 상태를 수집하고 분석합니다. 자동화된 지표 수집을 통해 표준화된 KPI를 구축하며, 임계값 기반 및 이상 탐지 알림을 위한 CloudWatch Alarms를 구성합니다. 구조화된 애플리케이션 로그는 CloudWatch Logs에 중앙 집중화되어, CloudWatch Logs Insights 쿼리와 실시간 운영 가시성을 위한 사용자 정의 대시보드를 통해 신속한 근본 원인 분석을 가능하게 합니다. 구현 사례 - Happy Edu Tech AI 기반 수학 학습 플랫폼: 인프라 지표(Lambda 동시 실행 수, SQS 큐 깊이, Auto Scaling Groups 내 GPU 인스턴스 사용률), 애플리케이션 성능(OCR 처리 시간, AI 추론 지연 시간, ElastiCache 캐시 적중률, API Gateway 요청 수), AI 특화 비즈니스 KPI(OCR 정확도, 모델 추론 응답 시간, 문제 인식 성공률)를 포함한 포괄적인 모니터링을 배포했습니다. 계층별 심각도 수준으로 CloudWatch Alarms를 구성했습니다: 중요 알림(OCR 정확도 <70%, AI 추론 지연 시간 >10초, SQS 메시지 대기 시간 >5분)은 운영팀에게 즉시 SNS 알림을 전송하고, 경고 수준 알림(GPU 사용률 >80%, 캐시 적중률 <60%)은 모니터링 대시보드로 전송됩니다. PutMetricData API를 통해 AI 모델 성능을 캡처하는 사용자 정의 CloudWatch 지표를 구현했습니다. 여기에는 모델 유형별 실시간 정확도 추적, 사용자 상호작용 패턴, 문제 해결 성공률이 포함되며, 개발팀에게 모델 상태, 시스템 성능, 사용자 참여 지표를 보여주는 MLOps 대시보드를 제공합니다. 이 대시보드는 성능 저하가 감지될 때 자동 재훈련을 트리거합니다. --- 인프라 상태(EC2/ECS CPU 및 메모리, RDS 연결, CloudFront 캐시 성능), 애플리케이션 지표(API 응답 시간, Lambda 실행 지속 시간 및 오류, S3 업로드 성공률), 비즈니스 KPI(일일 페이지 뷰, 활성 사용자 세션, 콘텐츠 게시율)를 추적하는 모니터링을 배포했습니다. 중요 임계값(오류율 >1%, 응답 시간 >3초)에서 운영팀에게 SNS 알림을 전송하고, 경고 임계값(CPU >75%, 메모리 >80%)에서 Slack 알림을 생성하는 CloudWatch Alarms를 구성했습니다. 기사 게시 워크플로우와 독자 참여를 포함한 매거진 특화 이벤트를 캡처하는 사용자 정의 지표를 구축하여, 편집팀과 기술팀이 실시간 플랫폼 모니터링을 위해 액세스할 수 있는 CloudWatch 대시보드에 표시했습니다. 참고 문서 고객 워크로드 상태 KPI 정의, 모니터링 및 분석: https://wishket-team.notion.site/customer-workload-health-kpis

한글

OPE-001-F

자료 링크 및 파일

Audit 신청자의 부가 설명

승빈: 기본적으로 가이드라인 존재. 퍼널스의 경우 이후에도 지속적으로 상태를 모니터링하여 고객사 확인 전에 경고를 보내기도 했음 KPI 로 쓰라고 제안은 했다 (고객에게 문서를 제공했다.) 하지만 고객이 실제 했는지는 확인하지 않음 (SMB) 밑에 해피에듀테크 런북 참조

OPE-001-K

자료 링크 및 파일

kotech-market-kpi-guidance.pdf

OPE-001-H

등록된 자료가 없습니다.

OPE-001-B

등록된 자료가 없습니다.