DevOps Engineering
Vision
Our goal is not merely to launch features, but to ensure they land successfully and provide real value to our customers. We strive to develop a best-in-class product that exceeds expectations across all user groups by meeting high-quality standards while ensuring reliability and maintaining an ease of operation and scalability to meet diverse customer needs. All team members should remain mindful of our target customers and the multiple platforms we support in everything we do.
Ensure our product excels in all aspects especially for our primary customer organization archetypes of large enterprises. This includes scalability, adaptability, and seamless upgrade paths. When designing and implementing features, always keep in mind compatibility for all our deployment options: self-managed, dedicated and Software as a Service (SaaS).
Develop our technical, diverse and global team to drive results that support our product and customer growth, while maintaining our values and unique way of working.
Mission
GitLab’s unique way of working asynchronously, handbook first method, utilization of the product we develop and clear focus on our values enables very high productivity. We focus on constantly improving quality, usability and reliability of our product to reach maximum customer satisfaction. Community contributions and customer interactions rely on efficient and effective communication. We are a data-driven, customer experience first, open core organization delivering one secure, reliable, world leading DevSecOps platform. Join us in setting new standards, driving innovation, pushing the boundaries of DevSecOps, and consistently delivering exceptional results for our customers.
We make complex workflows simple and intuitive
We are the platform advantage – we deliver unmatched DevOps acceleration and efficiency, like in the case of Siemens bringing together 20,000 fragmented developers and growing that to 40,000 through the application of our product. Our product scales seamlessly from startups with simple needs to enterprises with advanced CI/CD workflows and complex repository management. Our solutions reduce time-to-market and our reliability enables teams to focus on innovation, not maintenance.
Key Focus Areas
-
Having a rock-solid foundation
- Moving from reactive (bug burndowns) to proactive (pushing scalability boundaries)
- Raising the quality bar to deliver Enterprise-grade quality
- Optimizing for golden journeys
-
Competitive Displacement
- Targeted competitive wins in high-value areas
- Consolidated workflows and reduced operational complexity
- Customer-first mindset
-
Innovation and Creativity: Positioning GitLab as the premier platform for AI-Native software development:
- Contributing to company vision of agentic AI
- Key differentiators
- Platform intelligence
We aim to create an environment where high-performing teams can thrive, innovate and execute efficiently, ultimately driving GitLab’s competitive edge in the market.
Having a rock-solid foundation
Our approach to quality will evolve in 3 phases, ultimately targeting ease of use, intuitiveness, and usefulness. Top of mind will be depth and stability in support of our evolved customer base.
3-phases to Quality
- Moving from reactive to proactive quality management
- Stabilizing incident response
- Normalizing error budget management
- Clearing critical issue backlogs
- Raising the quality bar to meet customer expectations
- Implementing higher quality standards (99.9% → 8.76 hours of downtime per year)
- Delivering the improvements customers have been waiting for
- Transitioning from “good enough” to “reliably good”
- Optimizing golden journeys and workflows
- Identifying and perfecting key user paths
- Creating seamless experiences in critical workflows
- Organization-wide alignment around experience excellence
Organization
@startwbs * DevOps Engineering ** Create *** Code Review *** Remote Development *** Source Code *** Import ** Ops *** Deploy *** Package *** Verify:Runner *** Verify:CI Platform *** Verify:Pipelines **** Pipeline Authoring **** Pipeline Execution ** Plan *** Product Planning *** Project Management *** Knowledge @endwbs
The DevOps department is comprised of:
To better fulfill our goals and vision, we are following R&D interlock process.
Onboarding
Welcome to GitLab! We are excited for you to join us. Here are some curated resources to get you started:
- Joining as an Engineer
- Joining as an Engineering Manager
- Core DevOps GitLab Project
- Core DevOps Google Calendar
Meetings
Meeting (Internal Only, limited access) | Frequency | DRI | Topics |
---|---|---|---|
SaaS Availability | Weekly | Infrastructure | Incident corrective actions, Feature Change Lock status, Past Due Infradev, Error Budget, or Security updates |
Core DevOps Meeting | Weekly | Michelle Gill | Standards discussions, questions, help needed, FYIs, process improvements, collaborating on engineering strategy, event planning, leadership communication dissemination, wider initiative brainstorming, OKRs |
Product Quality Standup | Weekly | Michelle Gill | Coordination on shared quality objectives (bug burndown, instrumentation) across groups |
Slack Channels
Performance Indicators
To better measure our goals and vision, we are tracking Performance Indicators here.
People Processes
How We Work
Incident Management
Team members in some job families contribute to incident management directly through an on-call schedule for Incident Managers. Team members should complete onboarding so they can be added to the schedule when needed. These frequently asked questions cover exemptions and changing shifts.
Availability
Error budgets are tracked weekly and monthly to understand the reliability of a service aligned to .com availability. When a given team’s error budget is “in the red” for a period of 2 weeks, it will be reported on in the FY26 Product Quality Standup until enough weeks have passed in the green to stabilize (minimally a full 28-day period).
Cross-Functional Collaboration
Issues that impact code in another team’s product stage should be approached collaboratively with the relevant Product, UX, and Engineering managers prior to work commencing, and reviewed by the engineers responsible for that stage.
We do this to ensure that the team responsible for that area of the code base is aware of the impact of any changes being made and can influence architecture, maintainability, and approach in a way that meets their stage’s roadmap.
Architectural Collaboration
At times when cross-functional, or cross-departmental architectural collaboration is needed, the GitLab Architecture Evolution Workflow should be followed.
Follow the Sun Coverage
When cross-functional collaboration is required across global regions and time zones, it is recommended to adopt the Follow the Sun Coverage approach to ensure seamless global collaboration.
Decisions requiring approvals
At GitLab we value freedom and responsibility over rigidity. However, there are some technical decisions that will require approval before moving forward. Those scenarios are outlined in our required approvals section.
Security Vulnerability Handling
- The development groups who introduce or consume the dependency of concern (e.g. gems, libs, base images, etc.) are responsible for resolving vulnerabilities detected against the dependency.
- For business selected vendors that provide base images (RHEL’s UBI8 for example), we need to wait for their patches, or need to log Deviation Request (DR) as viable resolutions. The VulnMapper, an automation developed by the Threat Management team, can create vendor dependency DRs to a large extent, but there are still cases that DR needs to be reported manually.
- The assigned development group can redirect issues if the initial assignment was inaccurate, following the processes for shared responsibility issues and/or Shared responsibility functionality.
Ownership of Shared Services and Components
The GitLab application is built on top of many shared services and components, such as PostgreSQL database, Redis, Sidekiq, Prometheus and so on. These services are tightly woven into each feature’s rails code base. Very often, there is need to identify the DRI when demand arises, be it feature request, incident escalation, technical debt, or bug fixes. Below is a guide to help people quickly locate the best parties who may assist on the subject matter.
Ownership Models
There are a few available models to choose from so that the flexibility is maximized to streamline what works best for a specific shared service and component.
- Centralized with Specific Team
- A single group owns the backlog of a specific shared service including new feature requests, bug fixes, and technical debt. There may or may not be a counterpart Product Manager.
- The single group is a specific team, meaning there is an engineering manager and all domain owner individuals reside in this team. The DRI is the engineering manager.
- This single group is expected to collaborate closely and regularly in grooming and planning backlog.
- This model may require consensus from the Product Management counterpart.
- This model may fit a subject domain that experiences active development.
- Centralized with Virtual Team
- A single group owns the backlog of a specific shared service including new feature requests, bug fixes, and technical debt. There may or may not be a counterpart Product Manager.
- The single group is a virtual team, meaning it consists of engineers from various engineering teams, for example maintainers or subject matter experts. Typically there isn’t an engineering manager for this virtual team. The DRI is an appointed person in the group who may not necessarily be an engineering manager.
- This single group is expected to collaborate closely and regularly in refining and planning backlog.
- This model may fit a subject domain that’s in maintenance mode.
- Collectives
- Collectives consist of individuals from existing teams who voluntarily rally around a shared interest or responsibility, but unlike Working Groups may exist in perpetuity. The shared interest could be a specific technology or system. Collective members feel a collective responsibility to weakly own, improve upon or otherwise steer the subject they govern.
- This is a weaker form of the Virtual Team but introduces more structure than a fully decentralized model. It can be appropriate when some form of ownership is desirable where the subject has cross-cutting impact and wide reach and cannot clearly be allocated to any specific team.
- Collectives do not have product or engineering managers, they are fully self-governed.
- Members of the Collective sync regularly and keep each other informed about the shared interest. Problem areas are identified and formalized in the Collective, but are not logged into a Collective backlog. Instead a DRI is assigned who should put the task forward to the team with the greatest need for the problem to be resolved. This is to ensure that work is distributed fairly and that there are no two backlogs that compete with each other for priorities.
- Collectives work best when they consist of a diverse set of individuals from different areas of product and engineering. They double as knowledge sharing hubs where information is exchanged from across teams in the Collective first, and then carried back by the individuals to their specific teams.
- Decentralized
- The team who implements specific functions or utilizes certain features of the shared services is responsible for their changes from local development environment to production deployment to continued maintenance post-deployment. There is not a development-wide single DRI who owns a portion or the entirety of a shared service.
- A specialty team may exist for specific subject domains, however their role is to enable scalability, availability, and performance by building a solid foundation and great tools for testing and troubleshooting for other engineering teams, while they are not responsible for gating every single change in the subject domain.
Shared Services and Components
The shared services and components below are extracted from the GitLab product documentation.
Service or Component | Ownership Model | DRI and Group (Centralized Only) | Additional Notes |
---|---|---|---|
Alertmanager | Centralized with Specific Team | @twk3 Distribution | Distribution team is responsible for packaging and upgrading versions. Functional issues can be directed to the vendor. |
Certmanager | Centralized with Specific Team | @twk3 Distribution | Distribution team is responsible for packaging and upgrading versions. Functional issues can be directed to the vendor. |
Consul | |||
Container Registry | Centralized with Specific Team | Package | |
Email - Inbound | |||
Email - Outbound | |||
Elasticsearch | Centralized with Specific Team | @changzhengliu Global Search | |
GitLab K8S Agent | Centralized with Specific Team | @nicholasklick Configure | |
GitLab Pages | Centralized with Specific Team | @vshushlin Knowledge | |
GitLab Rails | Decentralized | DRI for each controller is determined by the feature category specified in the class. app/controllers and ee/app/controllers | |
GitLab Shell | Centralized with Specific Team | @andr3 Create:Source Code | Reference |
HAproxy | Centralized with Specific Team | Infrastructure | |
Jaeger | Centralized with Specific Team | @dawsmith Infrastructure:Observability | Observability team made the initial implementation/deployment. |
LFS | Centralized with Specific Team | @andr3 Create:Source Code | |
Logrotate | Centralized with Specific Team | @plu8 Distribution | Distribution team is responsible for packaging and upgrading versions. Functional issues can be directed to the vendor. |
Mattermost | Centralized with Specific Team | @plu8 Distribution | Distribution team is responsible for packaging and upgrading versions. Functional issues can be directed to the vendor. |
MinIO | Decentralized | Some issues can be broken down into group-specific issues. Some issues may need more work identifying user or developer impact in order to find a DRI. | |
NGINX | Centralized with Specific Team | @plu8 Distribution | |
Object Storage | Centralized with Specific Team | @lmcandrew Scalability::Frameworks | |
Patroni General except Geo secondary clusters | Centralized with Specific Team | @plu8 Distribution | |
Patroni Geo secondary standby clusters | Centralized with Specific Team | @luciezhao Geo | |
PgBouncer | Centralized with Specific Team | @plu8 Distribution | |
PostgreSQL Framework and Tooling | Centralized with Specific Team | @alexives Database | Specific to the development portion of PostgreSQL, such as the fundamental architecture, testing utilities, and other productivity tooling |
PostgreSQL GitLab Product Features | Decentralized | Examples like feature specific schema changes and/or performance tuning, etc. | |
Prometheus | Decentralized | Each group maintains their own metrics. | |
Puma | Centralized with Specific Team | @pjphillips Cloud Connector | |
Redis | Decentralized | DRI is similar to Sidekiq which is determined by the feature category specified in the class. app/workers and ee/app/workers | |
Sentry | Decentralized | DRI is similar to GitLab Rails which is determined by the feature category specified in the class. app/controllers and ee/app/controllers | |
Sidekiq | Decentralized | DRI for each worker is determined by the feature category specified in the class. app/workers and ee/app/workers | |
Workhorse | Centralized with Specific Team | @andr3 Create:Source Code |
Supporting Customers
Working with Support
When DevOps collaborates with Support it provides invaluable insight into how customers are using the product and the challenges they run into. A few tips to make the process efficient:
- Get access to Zendesk so you view the question and communication from customers.
- Always write answers in a way that they can be “cut-and-pasted” and sent to a customer.
- Reference documentation in your responses and make updates to GitLab documentation when needed.
- Refer to existing issues and epics to reiterate our transparency value and to invite participation from the customer.
- If you are unclear about the support-development collaboration process or workflow then please refer to the handbook page how to use gitlab.com to request help from the GitLab development team
Customer Account Escalation coordination
If development is the DRI or actively participating in a Customer Account Escalation, consider the following:
- Be careful to not make commitments to customers without first talking to product management and development leaders to confirm the impact that commitment may have on other commitments.
- The customer will want to know when they can see the benefits of a change. They may not be familiar with GitLab practices for tracking and predicting due dates and milestones. Also, they may not be familiar with our workflows and associated labels nor the predictability of code review timelines, different timelines on releases to GitLab.com compared with releases for self-hosted customers and our use of feature flags.
* Customers often don't rely on asynchronous communication at the level that GitLab does. Educate the customer on our practices and adapt to find a combined asynchronous and synchronous communication method and cadence that works for everyone. * Encourage customers to collaborate with us in epics, issues, and merge requests of interest. Keep in mind that they may not have access to ones that are confidential and/or may not be comfortable or able to collaborate with us in this public forum. * Consider utilizing Google documents to collaborate with the customer as a backup for collaboration via epics, issues, and merge requests. * Consider utilizing a shared Slack channel to collaborate, adding the customers to our slack via "one Slack channel access requests". [Example](https://gitlab.com/gitlab-com/team-member-epics/access-requests/-/issues/16192) * In meetings, tell customers why we like to record them and ask if they are OK with doing so. Consider using [Chorus](/handbook/sales/field-operations/sales-operations/go-to-market/chorus/) for scheduling the recordings to address legal requirements for recording meetings with customers. * In meetings, tell customers why we take notes before, during, and after the meeting, as it may not be natural for them to collaborate in this way. * Make sure the appropriate priority label is applied to all issues being tracked by the customer. * In the agenda for recurring meetings, track the items tracked by the customer in priority order at the top and review the status, next steps, customer DRI, and GitLab DRI for each. Discuss in the meeting periodically. Remind GitLab team members in Slack to update the status of items they are the DRI for before recurring meetings. * Post a link to the meeting notes and recording in a Slack channel for the customer escalation, so those who did not attend know that the notes and recording are available for review. * When there is an action item for someone in a meeting (whether they are present or not), tag them in an issue or MR (or in Slack) so they will see it.
Reducing the impact of far-reaching work
Because our teams are working in separate groups within a single application, there is a high potential for our changes to impact other groups or the application as a whole. We have to be cautious not to inadvertently impact overall system quality but also availability, reliability, performance, and security.
An example would be a change to user authentication or login, which might impact seemingly unrelated services, such as project management or viewing an issue.
Far-reaching work is work that has wide-ranging, diffuse implications, and includes changes to areas which will:
- be utilized by a high percentage of users
- impact entire services
- touch multiple areas of the application
- potentially have legal, security, or compliance consequences
- potentially impact revenue
If your group, product area, feature, or merge request fits within one of the descriptions above, you must seek to understand your impact and how to reduce it. When releasing far-reaching work, use a rollout plan. You might additionally need to consider creating a one-off process for those types of changes, such as:
- Creating a rollout plan procedure
- Consider how to reduce the risk in your rollout plan
- Document how to monitor the rollout while in progress
- Describe the metrics you will use to determine the success of the rollout
- Account for different states of data during rollout, such as cached data or data that was in a previously valid state
- Requiring feature flag usage (example)
- Changing a recommended process to a required process for this change, such as a domain expert review
- Requesting manual testing of the work before approval
Identified areas
Some areas have already been identified that meet the definition above, and may consider altered approaches in their work:
Area | Reason | Special workflows (if any) |
---|---|---|
Database migrations, tooling, complex queries, metrics | impact to entire application The database is a critical component where any severe degradation or outage leads to an S1 incident. | Documentation |
Sidekiq changes (adding or removing workers, renaming queues, changing arguments, changing profile of work required) | impact to multiple services Sidekiq shards run groups of workers based on their profile of work, eg memory-bound. If a worker fails poorly, it has the potential to halt all work on that shard. | Documentation |
Redis changes | impact to multiple services Redis instances are responsible for sets of data that are not grouped by feature category. If one set of data is misconfigured, that Redis instance may fail. | |
Package product areas | high percentage of traffic share | |
Gitaly product areas | high percentage of traffic share | |
Create: Source Code product areas | high percentage of traffic share. Special attention should be paid to Protected Branches, CODEOWNERS, MR Approvals, Git LFS, Workhorse and the git over SSH / gitlab-sshd interfaces. Please contact the EM (@sean_carroll) or PM (@tlinz) if you are unsure. | |
Pipeline Execution product areas | high percentage of traffic share | Documentation |
Authentication and Authorization product areas | touch multiple areas of the application | Documentation |
Compliance product areas | potentially have legal, security, or compliance consequences | Code Review Documentation |
Workspace product areas | touch multiple areas of the application | Documentation |
Specific fulfillment product areas | potentially impact revenue | |
Runtime language updates | impacts to multiple services | Ruby Upgrade Guidelines |
Application framework updates | impacts to multiple services | Rails Upgrade Guidelines |
Navigation | impact to entire application | Proposing a change that impacts navigation |
Ops Sub-department
126f36fe
)