Job Details

Job Information

Machine Learning Infrastructure Engineer - SIML, ISE
AWM-1456-Machine Learning Infrastructure Engineer - SIML, ISE
11/24/2025
11/29/2025
Negotiable
Permanent

Other Information

www.apple.com
Cupertino, CA, 95015, USA
Cupertino
California
United States
95015

Job Description

No Video Available
 

Role Number: 200633545-0836

Summary

Are you passionate about Generative AI? Are you interested in working on groundbreaking generative modeling technologies to enrich billions of people? We are the Intelligence System Experience (ISE) team within Apple’s software organization. The team operates at the intersection of multimodal machine learning and system experiences. Our multidisciplinary ML teams focus on a broad spectrum of areas, including Visual Generative Foundation Models, Multimodal Understanding, Visual Understanding of People, Text, Handwriting, and Scenes, Personalization, Knowledge Extraction, Conversation Analysis, Behavioral Modeling for Proactive Suggestions, and Privacy-Preserving Learning. These innovations form the foundation of the seamless, intelligent experiences our users enjoy every day.

We are seeking a ML Infrastructure Engineer to design, optimize, and scale the systems that power large-scale model training across the organization. This role sits at the intersection of high-performance computing, machine learning, and infrastructure engineering, delivering the core capabilities teams rely on to iterate quickly and reliably.

Description

The ideal candidate brings strong software engineering fundamentals, deep familiarity with distributed training, and a passion for building infrastructure that is efficient, observable, and easy for ML practitioners to use. You’ll work closely with model developers and platform teams to ensure training workflows are fast, reliable, and cost-effective—while also supporting users operationally to keep them unblocked and productive.

As an ML Training Infrastructure Engineer, you will:
* Build and maintain distributed training infrastructure
* Optimize training performance through profiling, parallelization strategies and hardware-aware tuning.
* Develop reliable pipelines for data loading, checkpointing, logging, and monitoring to support high-throughput training jobs.
* Collaborate directly with ML engineers to understand scaling bottlenecks and design solutions that improve both training speed and resource efficiency.
* Create and maintain tooling that simplifies how users configure, launch, and debug distributed training jobs.
* Implement strong observability across training workflows—telemetry, dashboards, alerts, and diagnostics.
* Support training workloads, investigate failures, triage performance regressions, and gather real feedback from users.

Minimum Qualifications

  • Bachelors, Masters degree in Computer Science, or a related technical field; or equivalent practical experience.

  • 3+ years of experience in software development, with strong Python skills and familiarity with systems programming concepts.

  • Hands-on experience with ML training frameworks (e.g., PyTorch Distributed, DeepSpeed, JAX, TensorFlow).

  • Knowledge of distributed systems, parallel computing, and accelerator architectures (GPU/TPU).

  • Experience debugging performance and reliability issues in complex, large-scale systems.

Preferred Qualifications

  • Strong communication skills and the ability to collaborate with ML practitioners and infra teams.

Apple is an equal opportunity employer that is committed to inclusion and diversity. We seek to promote equal opportunity for all applicants without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, Veteran status, or other legally protected characteristics. Learn more about your EEO rights as an applicant (https://www.eeoc.gov/sites/default/files/2023-06/22-088_EEOC_KnowYourRights6.12ScreenRdr.pdf) .

Other Details

No Video Available
--

About Organization

 
About Organization