Staff Site Reliability Engineer
Company: VirtualVocations
Location: Eugene
Posted on: April 1, 2025
|
|
Job Description:
A company is looking for a Staff Site Reliability Engineer
focused on Machine Learning Infrastructure.
Key Responsibilities
Design and implement robust ML infrastructure for training,
deployment, monitoring, and scaling of machine learning models
Improve reliability, availability, and scalability of ML
infrastructure while ensuring efficient workflows
Collaborate with various teams to identify infrastructure
requirements and streamline the ML lifecycle
Required Qualifications
7+ years of experience in Site Reliability Engineering, DevOps, or
infrastructure engineering roles
Expertise with on-premises infrastructure for machine learning
workloads (e.g., Kubernetes, Docker)
Proficiency with infrastructure automation and configuration
management tools (e.g., Terraform, Ansible)
Experience with observability, monitoring, and logging for ML
systems (e.g., Prometheus, Grafana)
Familiarity with popular Python-based ML frameworks (e.g., PyTorch,
TensorFlow)
Keywords: VirtualVocations, Eugene , Staff Site Reliability Engineer, Professions , Eugene, Oregon
Click
here to apply!
|