Software Engineer - Performance Tools
Company: Etched
Location: San Jose
Posted on: April 1, 2026
|
|
|
Job Description:
About Etched Etched is building the world’s first AI inference
system purpose-built for transformers - delivering over 10x higher
performance and dramatically lower cost and latency than a B200.
With Etched ASICs, you can build products that would be impossible
with GPUs, like real-time video generation models and extremely
deep & parallel chain-of-thought reasoning agents. Backed by
hundreds of millions from top-tier investors and staffed by leading
engineers, Etched is redefining the infrastructure layer for the
fastest growing industry in history. Job Summary Join our team as a
Software Engineer - Performance Tools and take the lead in
illuminating the performance landscape of our cutting-edge ML
accelerator. We are seeking a highly skilled engineer to design and
develop a sophisticated performance analysis tool, tailored
specifically for Sohu. You will be instrumental in creating the
essential tooling that enables our ML engineers and customers to
understand workload behavior, identify performance bottlenecks, and
unlock the full potential of Sohu accelerating the most demanding
ML applications in the world. This is a unique opportunity to shape
performance analysis for novel hardware from the ground up. Key
responsibilities Lead the design and architecture of a
comprehensive performance analysis suite, including data collection
mechanisms, data processing pipelines, analysis engines, and user
interfaces (CLI and/or GUI). Develop robust methods to capture
performance data directly from our custom ML accelerator hardware
(e.g., hardware performance counters, execution unit status, memory
access patterns) via driver interfaces or other mechanisms.
Implement tracing for host-side API calls (runtime libraries,
driver interactions) and system-level events (CPU activity, PCIe
traffic, memory usage, network contention) related to Sohu
workloads. Design and implement techniques to accurately correlate
performance events across the host CPU, device driver, PCIe bus,
multiple accelerators, and multiple hosts, ensuring precise time
synchronization. Build analysis modules to automatically interpret
collected trace and counter data, identifying key performance
limiters (e.g., compute-bound, memory bandwidth-bound,
latency-bound, PCIe-bound, specific hardware bottlenecks). Develop
intuitive visualizations (timelines, dependency graphs, resource
utilization charts, statistical summaries) to clearly communicate
performance characteristics and bottlenecks to users. Work closely
with hardware architects, firmware engineers, driver developers,
compiler engineers, and ML application engineers to understand
their needs, define tool requirements, and provide expert guidance
on performance analysis and optimization using the tool.
Representative projects Architect and implement the core data
collection framework for hardware performance counters on a custom
PCIe-based accelerator. Develop a kernel driver module or
user-space service for low-overhead tracing of accelerator
activity. Design and build a correlated timeline view visualizing
CPU API calls, driver submissions, PCIe transfers, and accelerator
execution units. Create an analysis pass to detect and quantify
memory access inefficiencies or PCIe bandwidth saturation while
transacting on a PCIe-attached accelerator. You may be a good fit
if you have Strong proficiency in C++ or Rust Proficiency in Python
is a plus Deep understanding of computer architecture (CPU, GPU,
accelerators), memory hierarchies (caches, DRAM), and interconnects
(especially PCIe). Proven experience in low-level performance
analysis, profiling, and bottleneck identification on complex
hardware systems (GPUs, CPUs, FPGAs, or custom ASICs). Experience
with performance analysis tools (e.g., NVIDIA Nsight, AMD uProf,
Intel VTune, perf, Tracy, ETW). Experience working close to
hardware, potentially reading performance counters or interacting
directly with device drivers. Strong candidates may also have
experience with (Nice-to-have qualifications) Direct experience
developing performance analysis or debugging tools. Experience with
ML accelerator architectures (GPUs, TPUs, etc.). Experience with
kernel-mode driver development (Linux or Windows). Understanding of
compiler internals, code generation, and optimization. In-depth
knowledge of the PCIe protocol and analysis tools (PCIe analyzers).
Experience with multi-chip or multi-host accelerator systems (e.g.,
TPU pods, or NVidia DGX clusters) Experience with firmware or
embedded systems development. Experience with hardware description
languages (Verilog, VHDL) or hardware verification. Benefits
Medical, dental, and vision packages with generous premium coverage
$500 per month credit for waiving medical benefits Housing subsidy
of $2k per month for those living within walking distance of the
office Relocation support for those moving to San Jose (Santana
Row) Various wellness benefits covering fitness, mental health, and
more Daily lunch dinner in our office How we’re different Etched
believes in the Bitter Lesson . We think most of the progress in
the AI field has come from using more FLOPs to train and run
models, and the best way to get more FLOPs is to build
model-specific hardware. Larger and larger training runs encourage
companies to consolidate around fewer model architectures, which
creates a market for single-model ASICs. We are a fully in-person
team in San Jose and Taipei, and greatly value engineering skills.
We do not have boundaries between engineering and research, and we
expect all of our technical staff to contribute to both as
needed.
Keywords: Etched, Pittsburg , Software Engineer - Performance Tools, Engineering , San Jose, California