The role: Site Reliability Engineer
Location: SFO, CA
Duration: Full time/Contract
Site Reliability Engineers are responsible for the pulse of the software ecosystem. We monitor and improve the system and suggest improvements for implementation by others. The name of the game is automating our job, because hiring linearly with our traffic growth is unsustainable. We are involved in incident and change management. We also act as consultants for engineers when new code and services are getting ready to go live.
Detective: SREs handle problems in live production systems, both on their own and in collaboration with systems and application engineers.
Ambassador: Keep the company informed about the status of services, the impact of known issues, and the progress of ongoing investigations.
Developer: Design and refactor parts of services/backend system for stability, reliability and performance apart from writing scripts to automate maintenance and monitoring of repetitive tasks aka TOIL.
Coach: Meet with other teams and attend architecture reviews, and offer advice on how to implement features that are efficient, highly available, and fault-tolerant.
What do we look for?
We want people that:
Write code in Python and perhaps Java, and not just for classes.
Dig into the details of how a system, library, or tool works instead of just blindly using it.
Are willing and eager to wear many hats, as illustrated by the roles described above.
Dive into things that "aren't their problem."
Are willing to teach and lead others.
You have 8+ years of total IT industry experience with Lead/Manager role for couple of years
You have 3+ years of experience as a systems/operation engineer or system administrator
You have 3+ years of Java/J2EE experience as a developer preferably in e-commerce domain
You are comfortable with the Python programming language and ecosystem
You are very comfortable using and administering Linux servers
You can work independently with limited supervision
You can communicate effectively with peers and to tailor your communication to your audience
You have a willingness to dive in and assist co-workers when incidents arise
You're willing to participate in the team's production on-call rotation
Experience working with high-traffic, scalable web applications and services
Experience building, deploying, and operating your own web service
Knowledge of the administration and/or performance tuning of MySQL or Cassandra
Prior experience being part of an on-call rotation and responding to production incidents
Experience with cloud computing platforms like AWS or Google Cloud Platform
Familiarity with configuration management tools like Puppet, Chef or Ansible (we use Puppet and Ansible)
Experience developing and shepherding processes around change and incident management
Some familiarity with Java and its ecosystem
Experience with one or more of the technologies in our stack (or similar technologies):
Frameworks: Hibernate, Spring, ,
Logging and Monitoring: Splunk, Dynatrace, Nagios, Logstash, Kibana (ELK)
Apply now to have the opportunity to be considered for similar jobs at leading companies in the Seen network for FREE.
Zero stress and one profile that can connect you directly to 1000s of companies.
We’ll take it from there. After you tell us what you’re looking for, we’ll show you off to matches.
Boost your interview skills, map your tech career and seal the deal with 1:1 career coaching.
Join now and Be Seen.