/

H2O.ai Q&A

Latest

News

Insights

Tips

Tutorials

Comparisons

Glossary

Other

Q&A with Ophir Zahavi at H2O.ai: Running Secure, Automated, Multi-Cloud Infra for AI/ML at Scale

Emily Lehman

•

Aug 4, 2025

Ophir Zahavi and his team are building one of the most automation-heavy, multi-cloud, zero trust AI infra stacks we've seen.

H2O.ai, the world’s leading agentic AI company, converges Generative and Predictive AI to power solutions used by AT&T, PayPal, and half of the Fortune 500.

As H2O's Senior Manager of Cloud Engineering, Ophir leads a globally distributed platform team that runs infrastructure across major cloud providers and multiple specialized GPU providers. Behind the scenes, Ophir’s team is responsible for the platform that enables both customer deployments and internal research environments, all designed for speed, security, and scale.

What makes Ophir's work so compelling isn’t just scale. It’s how his team builds for it: VPNs are out. Context-aware, identity-driven access to services, clusters, and apps is the default. Every piece is automated, observable, and hardened, without slowing down experimentation.

We sat down to unpack how they’ve operationalized platform engineering in a multi-cloud, GPU-intensive world, what “zero trust” really means when your stack spans secure enclaves and ephemeral clusters, how they approach multi-cloud complexity without compromising on control, and what platform teams can learn from their approach.

Here’s our conversation with Ophir, lightly edited for brevity.

Q: Tell us about what H2O.ai does, and what your responsibilities are as Cloud Engineering Manager.

Ophir Zahavi: H2O.ai is all about democratizing AI. As the leading agentic AI company, H2O offers Generative and Predictive AI to help enterprises and public-sector agencies develop purpose-built GenAI applications on their private data. H2O open source technology is trusted by over 20000 organizations worldwide, including over half of Fortune 500 companies like AT&T, Commonwealth Bank of Australia, Singtel, Chipotle, Workday, and many more.

At H2O.ai, I lead the DevOps and Platform Engineering teams. We’re responsible for the Managed Cloud, H2O’s PaaS platform, as well as our internal cloud infrastructure, which spans multiple cloud providers. Our mission is to ensure the infrastructure is secure, scalable, and reliable, while also empowering our developers to focus on building products that advance H2O.ai’s mission to democratize AI.

Q: We often hear that security and developer velocity are in conflict, especially in fast-moving AI environments. How do you balance these priorities?

OZ: It’s a common perception that security slows things down, but in reality, when done right, security becomes an enabler. At H2O.ai, we treat security as part of the developer experience, not something bolted on at the end.

By investing in the right tooling - like infrastructure as code, automated access controls, and solutions like Twingate - we’ve made security seamless. Developers don’t have to think about how to get access or whether something is compliant. It just works - securely and reliably.

The key is to embed security into the workflow so it feels like part of the platform, not a speed bump. That way, teams can move fast without cutting corners.

Q: How do you build and maintain a culture of security across non-security teams, particularly with rapid global hiring?

OZ: Security is everyone’s responsibility, not just the security team’s, and that mindset has to be built into the culture from day one. Personally, I focus on leading by example: we treat every infrastructure change as code, go through proper reviews, and constantly ask, “Is this the secure way to do it?”

At the broader org level, we’ve invested in onboarding, clear processes, and tooling that builds security into the workflow. Whether it’s access control, CI/CD, or device posture, we try to make the secure path the default path so it’s easy to do the right thing.

Rapid hiring, especially across time zones, makes consistency harder but also more important. That’s why we automate wherever possible, document everything, and build guardrails into our systems. It’s not about slowing teams down. It’s about giving them the confidence to move fast without compromising security.

Q: What advice would you give to other security and platform leaders in AI/ML organizations who are considering adopting a zero trust security model?

OZ: For anyone considering a zero trust approach, especially in fast-moving AI/ML environments, my biggest advice is: start with your pain points. Understand what’s not working today, whether it’s VPN sprawl, inconsistent access control, or lack of visibility. That clarity will help you prioritize what matters most in your rollout.

Also, think about the developer experience from day one. If access feels like a bottleneck or adds friction, teams will find workarounds, and that’s exactly what you don’t want. The secure path has to be the easy path.

Make sure your solution fits with infrastructure-as-code practices. If it doesn’t integrate with Git, Terraform, or your automation stack, you’ll struggle to scale it and maintain consistency.

And don’t underestimate global performance. What works great in one region might fall apart with latency or reliability issues in others. We learned that the hard way.

Q: Any specific lessons learned or "gotchas" you'd want others to avoid?

OZ: One “gotcha” was assuming that switching to zero trust would automatically simplify everything. The reality is, it takes some initial investment, especially in onboarding and change management. But once it’s in place, the long-term gains in security, visibility, and productivity are absolutely worth it.

Q: H2O.ai runs significant ML workloads on Kubernetes. Can you tell us about your current K8s setup and how you're managing access to these clusters?

OZ: Yes, H2O.ai runs a large portion of its AI/ML workloads on Kubernetes, across a multi-cluster, multi-cloud setup. We have clusters dedicated to customer environments, development and testing in AWS, Azure, and GCP.

Managing access to Kubernetes in that kind of environment is definitely a challenge. We have to balance security with developer velocity, making sure access is secure, auditable, and least-privilege, without slowing teams down.

We’ve integrated access controls into our CI/CD pipelines and use tools like Twingate to handle secure connectivity to K8s APIs. Developers don’t get blanket access to clusters. Instead they get scoped, time-bound access based on roles, and it’s all managed as code.

Q: How do you see zero trust principles applying to Kubernetes environments? What would be your ideal approach?

OZ: zero trust fits naturally with how we think about securing Kubernetes. The traditional, network-based approach just doesn’t scale in a multi-cluster setup. Instead, we focus on identity-based access where every user and service is authenticated and authorized based on who they are, not where they’re coming from.

For us, the ideal approach starts with least-privilege access at the role level. Developers only get access to the resources they need, scoped by cluster and often with time-based restrictions.

We also see value in continuous verification not just at login, but throughout the session. Whether that’s tied to device posture, location, or role changes, the access should adapt dynamically.

And finally, service-to-service communication within the cluster should follow zero trust as well. That’s where service meshes come in: allowing us to enforce mutual TLS, traffic policies, and observability between workloads.

It’s all about moving from static trust to adaptive, context-aware access, which is exactly what Kubernetes needs at scale.

This setup has really helped us maintain consistency, improve onboarding, and support developer productivity, especially for teams working across different time zones and cloud environments.

Q: For organizations running AI/ML workloads on Kubernetes, what's your advice on implementing zero trust for container environments?

OZ: If you’re running AI/ML workloads on Kubernetes and thinking about zero trust, my advice is to start with identity and access management. That means treating both users and services as first-class identities. Use centralized auth, role-based access controls, and make sure everything is auditable.

Non-human access is often overlooked, but it’s just as critical. Service accounts, CI/CD pipelines, and automation tools should have tightly scoped, least-privilege access. These are often the easiest entry points if not properly managed.

I’d also recommend looking into service mesh integration early. A mesh gives you secure service-to-service communication with features like mTLS, policy enforcement, and observability, all key to a zero trust model at the network layer.

And finally, plan for GitOps and CI/CD security from the start. Infrastructure and access should be defined in code, reviewed, and deployed through trusted pipelines. That’s what gives you both speed and control, and helps zero trust scale with your team and workloads.

A big thank you to Ophir for taking the time to sit down and talk with us! Check out our full case study with H2O.ai to learn more about how Ophir and H2O.ai use Twingate to power secure, infrastructure-as-code access for their global teams, so they can focus on accelerating AI development without compromising security or control.

New to Twingate? We offer a free plan so you can try it out yourself, or you can request a personalized demo from our team.

Rapidly implement a modern Zero Trust network that is more secure and maintainable than VPNs.

Try Twingate for Free

Request Demo

Blog

/

H2O.ai Q&A

Q&A with Ophir Zahavi at H2O.ai: Running Secure, Automated, Multi-Cloud Infra for AI/ML at Scale

Emily Lehman

•

Aug 4, 2025

Ophir Zahavi and his team are building one of the most automation-heavy, multi-cloud, zero trust AI infra stacks we've seen.

H2O.ai, the world’s leading agentic AI company, converges Generative and Predictive AI to power solutions used by AT&T, PayPal, and half of the Fortune 500.

As H2O's Senior Manager of Cloud Engineering, Ophir leads a globally distributed platform team that runs infrastructure across major cloud providers and multiple specialized GPU providers. Behind the scenes, Ophir’s team is responsible for the platform that enables both customer deployments and internal research environments, all designed for speed, security, and scale.

What makes Ophir's work so compelling isn’t just scale. It’s how his team builds for it: VPNs are out. Context-aware, identity-driven access to services, clusters, and apps is the default. Every piece is automated, observable, and hardened, without slowing down experimentation.

We sat down to unpack how they’ve operationalized platform engineering in a multi-cloud, GPU-intensive world, what “zero trust” really means when your stack spans secure enclaves and ephemeral clusters, how they approach multi-cloud complexity without compromising on control, and what platform teams can learn from their approach.

Here’s our conversation with Ophir, lightly edited for brevity.

Q: Tell us about what H2O.ai does, and what your responsibilities are as Cloud Engineering Manager.

Ophir Zahavi: H2O.ai is all about democratizing AI. As the leading agentic AI company, H2O offers Generative and Predictive AI to help enterprises and public-sector agencies develop purpose-built GenAI applications on their private data. H2O open source technology is trusted by over 20000 organizations worldwide, including over half of Fortune 500 companies like AT&T, Commonwealth Bank of Australia, Singtel, Chipotle, Workday, and many more.

At H2O.ai, I lead the DevOps and Platform Engineering teams. We’re responsible for the Managed Cloud, H2O’s PaaS platform, as well as our internal cloud infrastructure, which spans multiple cloud providers. Our mission is to ensure the infrastructure is secure, scalable, and reliable, while also empowering our developers to focus on building products that advance H2O.ai’s mission to democratize AI.

Q: We often hear that security and developer velocity are in conflict, especially in fast-moving AI environments. How do you balance these priorities?

OZ: It’s a common perception that security slows things down, but in reality, when done right, security becomes an enabler. At H2O.ai, we treat security as part of the developer experience, not something bolted on at the end.

By investing in the right tooling - like infrastructure as code, automated access controls, and solutions like Twingate - we’ve made security seamless. Developers don’t have to think about how to get access or whether something is compliant. It just works - securely and reliably.

The key is to embed security into the workflow so it feels like part of the platform, not a speed bump. That way, teams can move fast without cutting corners.

Q: How do you build and maintain a culture of security across non-security teams, particularly with rapid global hiring?

OZ: Security is everyone’s responsibility, not just the security team’s, and that mindset has to be built into the culture from day one. Personally, I focus on leading by example: we treat every infrastructure change as code, go through proper reviews, and constantly ask, “Is this the secure way to do it?”

At the broader org level, we’ve invested in onboarding, clear processes, and tooling that builds security into the workflow. Whether it’s access control, CI/CD, or device posture, we try to make the secure path the default path so it’s easy to do the right thing.

Rapid hiring, especially across time zones, makes consistency harder but also more important. That’s why we automate wherever possible, document everything, and build guardrails into our systems. It’s not about slowing teams down. It’s about giving them the confidence to move fast without compromising security.

Q: What advice would you give to other security and platform leaders in AI/ML organizations who are considering adopting a zero trust security model?

OZ: For anyone considering a zero trust approach, especially in fast-moving AI/ML environments, my biggest advice is: start with your pain points. Understand what’s not working today, whether it’s VPN sprawl, inconsistent access control, or lack of visibility. That clarity will help you prioritize what matters most in your rollout.

Also, think about the developer experience from day one. If access feels like a bottleneck or adds friction, teams will find workarounds, and that’s exactly what you don’t want. The secure path has to be the easy path.

Make sure your solution fits with infrastructure-as-code practices. If it doesn’t integrate with Git, Terraform, or your automation stack, you’ll struggle to scale it and maintain consistency.

And don’t underestimate global performance. What works great in one region might fall apart with latency or reliability issues in others. We learned that the hard way.

Q: Any specific lessons learned or "gotchas" you'd want others to avoid?

OZ: One “gotcha” was assuming that switching to zero trust would automatically simplify everything. The reality is, it takes some initial investment, especially in onboarding and change management. But once it’s in place, the long-term gains in security, visibility, and productivity are absolutely worth it.

Q: H2O.ai runs significant ML workloads on Kubernetes. Can you tell us about your current K8s setup and how you're managing access to these clusters?

OZ: Yes, H2O.ai runs a large portion of its AI/ML workloads on Kubernetes, across a multi-cluster, multi-cloud setup. We have clusters dedicated to customer environments, development and testing in AWS, Azure, and GCP.

Managing access to Kubernetes in that kind of environment is definitely a challenge. We have to balance security with developer velocity, making sure access is secure, auditable, and least-privilege, without slowing teams down.

We’ve integrated access controls into our CI/CD pipelines and use tools like Twingate to handle secure connectivity to K8s APIs. Developers don’t get blanket access to clusters. Instead they get scoped, time-bound access based on roles, and it’s all managed as code.

Q: How do you see zero trust principles applying to Kubernetes environments? What would be your ideal approach?

OZ: zero trust fits naturally with how we think about securing Kubernetes. The traditional, network-based approach just doesn’t scale in a multi-cluster setup. Instead, we focus on identity-based access where every user and service is authenticated and authorized based on who they are, not where they’re coming from.

For us, the ideal approach starts with least-privilege access at the role level. Developers only get access to the resources they need, scoped by cluster and often with time-based restrictions.

We also see value in continuous verification not just at login, but throughout the session. Whether that’s tied to device posture, location, or role changes, the access should adapt dynamically.

And finally, service-to-service communication within the cluster should follow zero trust as well. That’s where service meshes come in: allowing us to enforce mutual TLS, traffic policies, and observability between workloads.

It’s all about moving from static trust to adaptive, context-aware access, which is exactly what Kubernetes needs at scale.

This setup has really helped us maintain consistency, improve onboarding, and support developer productivity, especially for teams working across different time zones and cloud environments.

Q: For organizations running AI/ML workloads on Kubernetes, what's your advice on implementing zero trust for container environments?

OZ: If you’re running AI/ML workloads on Kubernetes and thinking about zero trust, my advice is to start with identity and access management. That means treating both users and services as first-class identities. Use centralized auth, role-based access controls, and make sure everything is auditable.

Non-human access is often overlooked, but it’s just as critical. Service accounts, CI/CD pipelines, and automation tools should have tightly scoped, least-privilege access. These are often the easiest entry points if not properly managed.

I’d also recommend looking into service mesh integration early. A mesh gives you secure service-to-service communication with features like mTLS, policy enforcement, and observability, all key to a zero trust model at the network layer.

And finally, plan for GitOps and CI/CD security from the start. Infrastructure and access should be defined in code, reviewed, and deployed through trusted pipelines. That’s what gives you both speed and control, and helps zero trust scale with your team and workloads.

A big thank you to Ophir for taking the time to sit down and talk with us! Check out our full case study with H2O.ai to learn more about how Ophir and H2O.ai use Twingate to power secure, infrastructure-as-code access for their global teams, so they can focus on accelerating AI development without compromising security or control.

New to Twingate? We offer a free plan so you can try it out yourself, or you can request a personalized demo from our team.

Rapidly implement a modern Zero Trust network that is more secure and maintainable than VPNs.

Try Twingate for Free

Request Demo

Blog