- Min. 12 of SRE experience with Good understanding of the maturity level of the different stage of the SRE setup
- Practical experience in maintenance of large-scale distributed systems architectures, hybrid cloud/on-premise environments, and event-driven or event stream systems. (i.e. distributed storage, scheduling, big data computing system)
As Senior VP/ Director, Site Reliability Engineering Lead
, you will be responsible for leading and building a team of software/system engineers (including team recruitment, new talent training, system operation/maintenance/ coordination and team culture building), developing a long-term technical plan, have a clear implementation path and milestones, continuously ensure the competitiveness of the team and technology, designing and implementing software platforms as well as monitoring frameworks for efficient, automated, and intelligent event driven / service-oriented architecture governance, and monitoring, troubleshooting & analyse application & underlying infrastructure performance issues as part of the performance engineering exercises and derive gold-configuration parameters.
You are expected to set up necessary processes for efficient execution and advocate good engineering practices, including formulating process specifications and plans with regards to access, configuration, disaster recovery as well as fault handling for all critical paths of the operating platform, promoting the evolution of business architecture design through reduction of customer anxiety.
You will work with the bank infrastructure and software development teams to ensure services reliability (i.e.: system development team to ensure system reliability throughout the entire life cycle from system design to launch (Cradle to Grave), solution architects, application development team to ensure adherence to best practices in design and coding w.r.t SRE & CRE principles, and other business teams, improve cross-team coordination, ensure continuous improvement and optimization of business flows) and uptime appropriate to the needs of users and fast iterations of improvement, and assisting development team to tune the applications/ configurations for critical systems to comply with the NFR before going live in production and ensure the performance recommendations are part of the change request process.
You will also identify opportunities for continuous improvement in the full lifecycle of a large distributed system. (i.e. Design, development, configuration, testing, deployment, monitoring, and operations) Continuously evolve automated operation, maintenance facilities and platforms (automation of various manual tasks w.r.t performance monitoring, alerting, analysis, reporting, capacity planning etc to improve application observability, resiliency & operational efficiency), and ensuring appropriate governance w.r.t framework usage across multiple delivery streams and enhance the framework capability to meet the upcoming requirements.
You will drive thorough performance analysis of microservices code by using single-user code profiling techniques, participate & contribute to resiliency validation exercises and create proper reporting to the stakeholders and define critical performance KPIs, set alert rules and roll-out monitoring dashboards for Production with timely reporting to the stakeholders.
To qualify, individuals must possess:
- Practical experience in maintenance of large-scale distributed systems architectures, hybrid cloud/on-premise environments, and event-driven or event stream systems. (i.e. distributed storage, scheduling, big data computing system) is preferred.Must have:
- Good understanding of the maturity level of the different stage of the SRE setup
- Experienced with project and team management.
- Systematic in operation and maintenance thinking with the ability to find the balance between when to be tactical vs. strategic. Familiar with Linux systems and networking.
- Familiarity with Helm / Terraform
- Positive attitude towards continuous learning with a passion for software development and pays great attention to optimizing existing systems, building infrastructure as well as reducing/eliminating toil through automation.
Good to have:
- Hands-on experience in application monitoring with Grafana, Kibana, Prometheus, AppDynamics or Dynatrace
- Hands on experience in Chaos Engineering
- R&D experience is a bonus
Please reach out to Vyon Ng at 69500385 or VyonN@charterhouse.com.sg for a confidential discussion.
Only successful candidates will be notified.
EA License no.: 16S8066 I Reg no.: R1110857