MacStadium Blog

All Blog Posts

Virtualizing macOS at Scale for iOS Devops

VMworld Panel

MacStadium recently hosted a well-attended panel discussion on Virtualizing macOS at Scale for iOS DevOps with some of our top customers at VMworld in Las Vegas. Participants including Capital One, Box, and Travis CI highlighted how MacStadium’s VMware/Mac cloud helps them meet their Apple development and continuous integration (CI) needs. The panel focused on their best practices for decreasing build times and increasing efficiency, and also touched on what tools and methods they considered most helpful in achieving these goals.

There were quite a few highlights of the discussion. For one, all Flash SAN storage has made a huge difference for our customers in reducing the time it takes to create VMs. After converting to Pure Storage, Capital One saw the time to go from the template they generate within vSphere to provisioning it reduced from 20 minutes to about 20 seconds. When you are dealing with enterprise level CI and initiating 4,000 builds a day, that matters.

Capital One wasn’t the only organization who saw dramatic improvements, though. Travis CI, who provides developers a software as a service solution for CI and continuous deployment testing, creates different images to test that represent whatever developers might be using (like various versions of Swift and Xcode, etc.).  Ultimately, Travis CI winds up creating 54k VMs a day. Having the Mac infrastructure capable of supporting that available to them as a service has become so important Travis CI now considers it part of their very business model.

The move from VMware 6.0 to 6.5 has also paid huge dividends for MacStadium customers. Linked clones help VMs look and act more like containers, greatly speeding processing and provisioning both. With 6.5 and with Pure Storage, Box went from 20 to 30 minutes to clone one VM down to around 10 seconds. And, the changes and support VMware is providing for Mac virtualization is only growing. More support for new VMware features is coming soon!

There are more observations and insights available in the presentation, so it’s definitely worth watching if you haven't seen it yet. You can watch video of the MacStadium Virtualizing macOS at Scale for iOS DevOps panel here:

Or, you can find a transcript of the proceedings available below:

Speakers

Greg McGraw,  CEO of MacStadium

Ray Sennewald, Senior Software Engineer for Box

Alex Niderberg, Senior Manager and Lead Software Engineer at Capital One

Josh Kalderimis, Vice President of Product at Travis CI

Preston Lasebiken, Lead Systems Engineer for MacStadium

Greg: 00:00:00  

Hi, my name is Greg McGraw, and I'm CEO of MacStadium. I'll talk a little bit about MacStadium in just a minute, but this afternoon, we're going to talk about virtualizing macOS at scale for iOS DevOps. It's a fairly unique area for us because there aren't a whole lot of companies that host Mac computers. I've a got a great panel joining me today, both from enterprise-class companies, service providers, and MacStadium's resident VMware expert on the end. With me right here is Ray Sennewald, senior software engineer for Box. Next to him is Alex Niderberg, senior manager and lead software engineer at Capital One, Josh Kalderimis Vice President of Product at Travis CI, and Preston Lasebikan, the lead systems engineer for MacStadium. I'd like to let each of these folks tell you a little about themselves and about their environments that they're managing on a daily basis.

Ray:  

Yeah sure so I'll get started. So again, my name's Ray. I work at Box. I'm a senior software engineer, but I specialize in build and release for the applications engineering teams, so particularly macOS and the macOS team as well as the iOS team. So I help support their CI infrastructure, which includes MacStadium.

Alex:  

I'm Alex. I'm at Capital One. I work on a team that does tools for the mobile engineers across the US, the UK, and Canada. We're ultimately looking to provide tools built on MacStadium and AWS to power our mobile engineers to ship features to customers.

Josh

My name's Josh. I'm one of the founders at Travis CI. I work hitting up the products side of our development. At Travis, we provide developers a software as a service solution to continuous integration and continuous deployment. So essentially, as developers, you will want to test your software on an environment as similar to your development machine, while we need to run untrusted code with untrusted code with untrusted code across Linux, Mac. So we do this across MacStadium and our Google computer for Linux.

Preston:

I'm Preston. I'm the lead systems engineer at MacStadium. My main focus is supporting customers and troubleshooting problems that can arise with VMware running on Apple hardware.

Greg:

Excellent, thank you gentlemen. Just talk a little bit about MacStadium, for those of you in the room who have not heard of us or really know what we do. We're the only provider of enterprise-class hosting solutions as infrastructure as a service. Anybody else doesn't do it. Google cloud doesn't do it. Azure doesn't do it, and Apple doesn't do it. So primarily, we've taken the form factor of the Mac mini, the Mac Pro, added some innovation to it, and basically gave it a new life as a data center computing asset. So we've basically built the resource pool for Macs.

Greg:

Where that comes into play is really “why Macs?” I mean I get this question a lot when I'm at a conference or at a bar. They say, "You host Mac computers?" I say, "Yeah, we've got 20,000 of them," and principally because we respect the fact that if you really want to develop rich iOS applications, you really need to use Xcode. Xcode is part of macOS. macOS only runs on genuine Apple hardware, and that's why MacStadium exists. Other cloud providers, as I said, don't really do this, and what we also find is that managing these basically disparate Mac resources internally is hard. So we try to make that easier.

Greg:

So why VMware? A couple years ago, one of our customers who was basically running bare metal on MacStadium said, "Hey, can I run VMware on this?" We were already a VMware partner, a global partner, so we said, "Sure." And through that integration and through those experiences we've continued to build upon how to improve the performance of those Mac resource pools in an enterprise-class environment. I'll let these gentlemen talk about each one of those scenarios for their companies and how they sort of are using the best practices, the best tools, to decrease build time, increase efficiency and operational excellence. Actually, VMware today is really the only virtualization option for Mac. There are a couple of other small startups that really fall short when you're really trying to build something at scale.

Greg:

So right now, I'd like to turn some of the questions over to the panel and have each of you sort of give us your story. Please say a few words about the CI infrastructure that you have running on Macs in your environments.

Ray:

Yeah sure. So at Box, we've got MacStadium as our infrastructure provider for Macs. We use Jenkins as our CI platform and we use the vSphere plug-in for Jenkins, which allows us to do some cool things, specifically give us one-time use builds with our VMs. So what we do is we actually use Terraform to build out all of our VMs, so infrastructure is code. With a little bit of scripting we add those as nodes to Jenkins and then using this plug-in, we can allow the VMs to revert back to a snapshot after every build. So it gives us container-like functionality but for macOS, and that's something we wanted to have at Box. We wanted to have a clean environment for every build, for every developer, because otherwise you're just going to run into problems.

Alex:

So at Capital One, we use a mix of enterprise, Jenkins, and then hardware that we run on MacStadium. We also use Terraform. We use Terraform plus Ansible to be able to provide VMs that live for about a week. What that ends up providing to a lot of our development teams is they can go and have full request patterns that run through GitHub. That gets executed onto slaves that are connected to Jenkins, and ultimately we provide a standardized manual where they can run both their Mac builds or their iOS builds and also their Android builds. We usually run 3,000 or 4,000 builds a day, and we'll run a lot of those by providing these VMs that are refreshed after minutes.

Josh

So at Travis we're using MacStadium as well as you can probably guess. We run 84 Mac Pros across two different vSphere clusters for HA for fail over, and across that we're doing 10 different images because with CI, when you're testing open source, you're essentially testing against what developers might be using. So you want to test against Xcode versions and different Swift versions. So we provide various different versions of these. Then, across a day we're doing about 54,000 VM starts. So we're starting up a fresh VM, throwing it away afterwards, running it for about six minutes on average, throwing it away, and some of our work load is for open source. So we can help the open source community test their software better for free because no one wants to really pay for a Mac to sit in a basement when you're just working on some open source. So we provide a lot of infrastructure for the open source community while also having a commercial offering.

Greg:

So obviously a lot of our customers tell us that the real value that they're really driving out of the Mac infrastructure is being able to virtualize it. Tell me what sort of entered in your decision making when you were running bare metal versus moving  over to a virtualized environment.

Ray:

So I can start. This was actually a decision that was made before me coming to Box, but I was very happy to know that that was the case when I started there. I've been there for about a year. At my previous organization we managed a set of Mac minis – and I say “we” but there was really no clear ownership of it. Developers didn't want to maintain it but they needed it for testing. My team, which was the CI infrastructure team wanted to maintain the software on there but we didn't want to maintain it. The host was online and IT didn't really want to maintain it because it was drastically different than everything else in their server room. So it was a real big pain point for us. So I'm glad whatever happened in the decision making at Box that they came to this decision because it makes my life a lot easier, but also developers as well.

Alex:

At Capital One, my team kind of came to view where we realized we were investing a lot in being able to deploy and keep infrastructure running in different cloud providers. We were also investing a lot in mobile development. We at the time had the Mac that we're running, and it really wasn't a stable platform for those developers to be able to push changes through. So ultimately, we realized it wasn't sustainable to try and manage bare metal machines, and ultimately to have something that could run through virtualization really empowered us to provide a consistent environment where developers could get their test results through and make sure that all things were stable before it was shipped to customers.

Josh:

For us, bare metal has never been an option because with CI, when you're running other people's code, you're running ... you may be running customer one and then the next build is customer two. You cannot have leftover artifacts. You can't have any of the file system changed or log files left over. So it's all about having clean rooms and boxed environments. So VMs is very important for how we do this in a secure fashion. Even if you take away this whole container discussion that kind of happens on the Linux side, that's not what CI is really aimed at either because we need ... security is very paramount for trust on a multi-tenant environment.

Greg:

Excellent. Tell us a little bit about some of the advantages or sort of how your clouds and your deployments are configured today using 6.5 or linked clones, or ephemeral builds, just sort of the different types of ways that your environment exists.

Ray: 00:10:40            

Yeah sure, so I kind of touched upon a little bit already, but essentially we just use regular clones using VMware but in vSphere 6.5. It's been much faster. I believe this is also due to storage improvement, which I think we'll get into a little bit later in the conversation. But we use Terraform to provision these and we connect them to Jenkins like I'd already mentioned. So that's pretty much the meat and potatoes of how we do it.

Alex:        

We are also operating on VMware 6.5. We have a process where we take an existing VM, we'll add new versions of Xcode and other tools that the developers need. A lot of our process is tied to use fastlane. So we'll also keep our VM and be able to basically enable the developers to select the version of Xcode that they need for their run and then be able to execute that in this standardized environment.

Josh:    

So we use 6.5 as well. We're not using snapshots. We are using the new SAN environments, so in the multiple operates that we've gone through with MacStadium as we've increased our load and expanded, we've been focusing on how do we get our boot times down. How do we make sure that we've got less spikes, less peaks. We do use Terraform and Packer for a lot of our building of the VMs, but we're starting VMs fresh. So we take a little bit of a penalty for this because of the different images that we're also starting at the same time. So because we've got ... we're not just starting one image. We kind of go through a suspend resume kind of state or snapshot. We start everything from fresh and we'll take about a 60 second penalty. I mean I know that there is some discussion coming up in a later question about some of the enhancements that are coming with us, but working closely with MacStadium we've also gone through some SAN improvements, which have decreased this from three minutes down to 60 [seconds] and brought  us huge speed improvements over the time.

Greg:

Preston, we've heard from a lot of our customers. It's about speed, and performance, and the build times, and those kind of things. What are some of the things that MacStadium's putting in place in terms of SAN, in terms of other networking topology to sort of enhance or to improve that performance?

Preston:     

So originally we were using certain SAN arrays like NetApp for instance.

Greg:

Spindle-based.

Preston:         

Yeah spindle-based basically, and they're a little bit over and of course the throughput's that you can get out of something like that was much, much lower. It got the job done at the time but there's lots of newer technology. So we finally got around to basically replacing everything with all flash storage, which I'm sure any of these guys can tell you there was a massive increase in performance of switching to flash storage. Throughput, build times cut down by minutes, I mean massive factors of time completely changed. We've gone through several network upgrades, one to include moving off of older Cisco gear going into a new topology off of the standard campus style and moving towards VX LAN. So it's mainly your standard Cisco 9K series using VX LAN. The throughput completely changed. Instead of having your entire backbone east to west having the capabilities of let's say 10, 20 gigs, internally right now we're running at 160 gigs throughput east-west. Just that simple topology.

Greg:      

So I know Preston you've looked at 6.7 and are basically rolling out on a limited basis right now. Are some of those performance improvements that the network is providing evident in 6.7?

Preston:     

So with 6.7, what I've seen is they've made several tweaks that will help developers running on MacStadium be able to create these builds much much faster. There've been changes to things such as the cloning process, which originally comes from VMware's VDI. That's the API that most of these developers are actually hitting. It's using the exact same process. So there are changes and tweaks to the ability to segment out how a clone works.

Preston:       

For instance, with 6.5, it used to be called VMFork, and now it's actually fully titled instant cloning. So the change was I have a base image and I want to make clones of that from the parent, and now I'm running of the child image. In 6.5 they have a very tight relationship in which if you do anything to do that parent image, you'll basically break the linkage between any of the child clones. In 6.7, that new performance that they've given to you is now you can have things like HA, DRS, and you can make changes live. You no longer have to do a freezing of that clone because essentially what you have to do is you build your original image. You build an image, save it, turn it off, power it off, and that's it. You don't touch that. Now you can make immediate changes and I mean it's a marked difference. It definitely creates a lot of usability for people to be able to run things now.

Greg:      

Yeah. You mentioned all flash SAN, and I know that at MacStadium we've all standardized on Pure. They’re actually an exhibitor here this week. It seems to have all the right security. It has all the right performance, has all the right aspects to really optimize environments. I'd like to hear from each of the panelists to tell us how that change over from the older NetApp storage, how has that affected your builds and how that effected your performance.

Ray:           

Yeah so it was huge for us. So like I said, we maintained ... basically we have a couple of ... we have around four templates, and we end up provisioning 100 VMs from that. On the old NetApp it would take us, and this was running an older version of vSphere as well, I believe it was 6.0. It would take us 20 to 30 minutes to clone one VM, and with 6.5 and with the Pure storage, we were able to get that down to around 10 seconds on average. So we could re-provision our whole fleet of VMs in 10 minutes now, which is awesome for us because now we can accommodate changes a lot easier. We really don't want to maintain state on these different VMs. So if we wanted to upgrade Xcode for example, I want to just blow away all of these VMs, and re-provision them, and that's what Terraform's really good at doing. Snapshots are really fast for us as well.

Ray:       

So it really allowed us to take advantage of this model that we were working on and it's huge for us. Previously we really just wouldn't accommodate Xcode upgrades unless we absolutely had to. So developers would be at my door begging me, "Hey man, we can get Xcode 9.2?" I'm like, "That's going to take us a lot of work." So now it's a lot easier for us. So it's really nice.

Alex:            

Yeah we have similar process where we try to stay a little bit closer I think as you guys do to the Xcode releases, even to try to get some of the betas out there for our teams to use. We saw when we shifted to the all flash array the time to actually go from the template we generate within vSphere to provisioning it went from 20 minutes to about 20 seconds. So it was a huge improvement in terms of being able to have these things up and running, and be able to refresh them a lot faster.

Josh:            

So we went from NetApp where I think we went from 10 minute spikes down to a one minute average, and for us this was a massive improvement because when you're dealing with 55,000 jobs per day, those spikes matter a lot. Across the tender the images that we have as well I mean that was much more reliable with the caching with SAN as well and how we also did our different slicing across the clusters. It's just been .. I guess more important for us is the whole build versus buy.

Josh:          

What I mean is instead of running this internally, it's been more effective for us to work with a partner on how best to understand how to scale VMware and how to scale that effectively across essentially a fleet that is continuing to grow because as our customers grow, I don't want to spend time within our organization of building infrastructure. I want to work with a partner on how best to build that infrastructure. So by going from NetApp to the SAN and even before that, we would've have been able to do that ourselves without working with an effective partner.

Ray:          

I'll actually layer onto that a little bit. That was another thing that went into our decision process as we didn't want to have to have spare Mac hardware on hand. So one of the things as we were trying to figure out how we wanted to solve this problem, it's nice to have a partner who ultimately has spare Macs and has 24x7 hands to go and fix stuff. So that was something that allowed us as a team to focus a lot more on providing the software, and the tools, and being able to help on the fastlane side, and optimizing and dealing with some of our networking considerations as opposed to dealing with the hardware level.

Greg: 00:20:18            

Also, one of the things, obviously since we've started into this hosting Macs sort of bare metal and then adding virtualization later on to it too, it's feedback from customers, and clients, and partners like this that has allowed us to help us figure out how we can sort of make recommendations or tweak our system to really drive that performance environment. I mean to be talking about a Mac Pro and IOPS in the same sentence is kind of an interesting dichotomy at MacStadium. Let's talk a little bit about automation. Obviously with CI, the whole breath of being able to automate builds, automate tools.

Greg:           

And you mentioned a couple of other tools that you have used and integrated into your environments. How would you characterize the tool environment, the tool companies? I know there's a new one that crops up it seems like every month. But what are those values that they can deliver to you that you're looking for in your environments?

Ray:            

Yeah, so I mean, the key value that we're really looking for is the ability for us to just focus on having as much as possible checked into source code, so that we can identify where problems happen. When you're doing things manually or you're doing something where it's not checked into source code, it's hard to pinpoint where the problems are at. That was something that was integral for us, so that's where Terraform and Ansible can really help people out for provisioning their infrastructure. And so in other parts of our organization, we use AWS. So we also do Windows builds and we use AWS for that.

Ray:         

We do similar stuff using Ansible and Puppet actually to do the provisioning on that side. But, this is something we didn't have before on the MacStadium side so, before we actually used to manage a lot of these VMs manually. So, bringing in something like Terraform made it a lot easier for us and made it easier so if we were to do a configuration change it's tied to the source code. So, if all of a sudden, developers start reporting problems I can be like okay well what changed on that day? Versus before, I'd ask my team and nobody really remembers. We could probably audit the logs on VMware’s side but that doesn't seem like the right way to do this. I would say that's where the power of it comes to be from our side. The other part would be being able to revert back to snapshot as a piece of automation that we use the vSphere API for. So that's really integral for us it allows us to ensure that we get clean builds for every VM. To us, it’s just knowing what is available out there and, like Greg said, there's a lot of tools that are changing all the time. I just try to see what other people are using. I'm not really trying to reinvent the wheel at my organization. I feel like there's a lot of organizations out there that have already probably solved this problem and so figuring out what's out there and reusing it is typically the way that I approach these kind of problems.

Alex:      

Are a lot of folks working to support iOS teams internally? A good mix?

Alex:     

I think one of the things that we saw as an interesting challenge was just how to keep up and how to have something that could ultimately scale to their needs. The way our set up today works is, we still have some fusion hosts that we ran previously. We create templates where we'll take...

Alex:       

We have a bakery where we'll store the last good image. We'll go add the new version of Xcode, we'll add new versions of SwiftLint, and other tools that the devs are requiring. We'll then, check that in and then ultimately that will run through some automated processes we have to send that to vCenter. Then, once it's in vCenter, we use Terraform to go ahead and be able to deploy that. Then we use Ansible after that to set up some configuration on the VM, set up some of the network stuff, and then ultimately there's some Python scripts that will tell Jenkins where those live. Then Jenkins can connect to those as nodes that it can use for execution and in the middle there's this validation step where we have a Jenkins pipeline that captures all these steps and enables you to say “I need to destroy the old VMs, I need to deploy the new VMs” and during that process we run a series of checks to make sure that some of the paths are set up correctly and we haven't accidentally removed our VM or something else that the  developers actually need.

Alex:     

We're continuing to work to improve that overall set of testing and a lot of our internal customers will actually share the repository where those tests live and give them the opportunity to contribute tests they want to have run before the VMs are considered good to be in service. That's another thing that we've found very powerful as we've been evolving our automation journey was actually empowering the folks that are depending on the tools to help define what we do to consider a VM ready for service. Finally, we run an iOS app on the Mac stuff and an Android app that does UI testing, unit test, just to make sure also functionally things are working as we'd expect.

Alex:   

That's the other thing, as we've been maturing, we've been adding a lot more checks on top to make sure what we're rolling out there's not gonna cause problems for folks.

Josh:     

You know what I love about this is I'm also taking mental notes. So, there's a lot of similarities oddly enough, in fact, I shudder to think about what the world was like before Terraform. It feels so long ago and yet it was only a year and a half or two years ago. But, having Terraform, and Packer scripts, and Chef, and Puppet, Ansible, all of these. It's all about documentation really. How we manage our infrastructure is about using this tooling as a method of documentation with pull requests for the team, so we can see what changes are being made and then how are they applied and keeping logs and state. We use Terraform very extensively across our organization for building VMs across Linux, and for Mac, and also for rolling out changes to vSphere. We've also built CLIs to interact with MacStadium and vSphere to check in and out Mac Pros when we may need to do maintenance to something.

Josh:  

We use a lot of Golang for the CLIs as well. It's all about automation for us so the less that we have to use, VNC or the HDMI client or the Flash client back in the day, the more that everything could be automated, the better it is for us.

Preston:

Internally, MacStadium is also using automation mainly to use mixtures of power CLI, Python, and Ansible to do testing before turning over these environments to customers. Traditionally, before working at MacStadium, I would have one vCenter with a ton of virtual machines in it that I'd have to manage. Now working at MacStadium, it's a little different from my personal case that now I have hundreds of thousands of vCenters that I have to manage or make sure that, "Okay this customer's having this, and this customer's doing this, and this customer's doing this." Okay, we've turned over this environment. There's one little setting that's missing from here then that causes builds are failing, you can't figure it out.

Preston:           

So utilizing automation we're able to say, "Okay, this works. Every environment that we turn over is exactly the same." We're trying to do that to clean things up because one little thing missed your logs might not catch it, may not even realize because this has been turned off or simple little things or vMotion is not enabled on that one host. You're trying to create builds and vCenter's just spitting out errors saying, "Okay, I can't build anything. I can't do that." So automation's been pretty huge in that case.

Greg:       

One of the things that we've heard from our customers and our prospects, is that they don't want to manage that anymore. A lot of that's really true for the software that they're running as well. Like, Josh, I asked you specifically since Travis has a CI hosted platform as well an on-prem version. I keep hearing Jenkins over here and Jenkins does not have a hosted platform. What are some of the customers that you have in particular that will choose either the hosted version or the on-prem version? And do they migrate between the two?

Josh:        

Sure, we've got TravisCI.com and .org; there’s a longer story about why there’s two. And then we've got TravisCI Enterprise which is our on-prem solution which we've got large customers like IBM which will be running it for their own security requirements and needs which connect to the enterprise. One of our customers, Schibsted, a customer as well of MacStadium, uses the exact set up that we use for running Mac builds and what we generally see is this difference ... It's a really funny topic for me of private clouds because back in the day we talked about “the cloud” as being AWS. And then “the cloud” moved to being VPCs, like being this private cloud and this is exactly how we see MacStadium. It's just a private cloud of Macs, it works just in the same way.

Josh:       

We've got APIs and what we want to do is provide the same end-face that we use for running our CI solution for our private customers, where the private cloud is really just a contract. And because we're adhering to this contract, we can pass over the same contract to our customers so they can plug it in to the vendor that we recommend. Schibsted is running Mac builds for their needs and I believe we're working on a shared case-study about how we use Mac builds with MacStadium and TravisCI. Does that passionately answer it? I get a little bit rambly at times, I'm sorry.

Greg: 00:30:41            

That's quite all right. Let's shift gears a little bit to security. I'm sure everyone in the room and the companies that you work in security is really paramount and getting more and more important from all the data protection laws, data privacy laws, GDPR. The same thing holds true in the dev environments, and where those dev environments would live and are deployed. Can you tell me a little about some of the things that have impacted, especially in your Mac environments, have impacted you on the security side?

Ray:           

I can't really think of specifics that have really impacted us aside from recently when we were doing a renewal with you guys, our security team had done a re-audit of MacStadium and all I know is that very thankful for your team working with our security team to make sure all those check boxes were checked, so many acronyms I can't remember them all. I heard you mention a couple there. It's not really my domain, all I know is that we were able to get through that and we have to vet every single cloud vendor that we use so it's the same thing that we do with AWS, with Asure, with Google Cloud. We use nearly every cloud provider at Box, different teams use them for different purposes. So for us, security is a huge concern. We're managing customer data, it's not something that we want to worry about and we actually can't use cloud providers unless they maintain all of the security requirements that we have. I don't remember what they are off the top of my head, to be completely honest, but that's what I can say.

Alex:          

Similar from our side. Being on the engineering side of a financial services company, security's a massive concern. One of the things before we were really able to start working with the MacStadium platform was a very thorough evaluation and there's kind of continued checks to make sure that it's ultimately providing something that meets the needs of the company standards. The other thing that's been nice about being in an environment that we ultimately control, we're in an isolated environment. We have the opportunity to connect that back to some of our development environments to another trusted environment that we control. That's been another thing on the security front that's been helpful for us to be able to have some connectivity from our Mac environment where you're going to have iOS simulators running, to be able to talk to internal APIs, and be able to validate some of those tests.

Josh:  

There's kind of like two levels to this question for me. On the first one is, what virtual machines give us so that is the security that we require for running untrusted code and a multi-tenants situation because we don't know what one person's code is going to do and try to interfere with someone else's code. So security is paramount in the isolation, not just CPU and memory but networking. The other side is our code is all open source. And we do this in such a way so we can share how we do it, but we can also do it in such a way that if people take a bit away that we can do something, then we're all open source and GitHub people can contribute. It's partially security by means of community involvement, but security in the sense of how we utilize the platform and why we are utilizing a certain technology.

Greg:

Preston, on the infrastructure side of MacStadium, are there any of the Cisco enhancements, firewalls, those kind of things that are also complementing those higher security requirements today?

Preston:            

One major thing is there is no shared cloud at MacStadium. Unlike lots of other cloud providers, you're not getting a virtual machine for your virtual private server that sits on the same host or sharing anything. So right now everybody from the firewall on down is dedicated to an individual. It also plays in partially to the point of meeting the hidden API for their calls to automate everything is that you're not sharing the API, that vCenter belongs strictly to each individual customer. Sometimes that may create a difference in manageability because each person's doing something different so the problems may come out differently but it also isolates that. Ok, is that a problem that hits Box, but is not a problem that's going to hit TravisCI or Capital One? It's just you, it won't affect anybody else.

Greg:       

You bring up a good point, it is one of the questions I get often about – "Why can't I spin up an Apple-virtualized environment and just buy by the drink like I can at AWS or Azure?” The reason is because Apple does not allow that. Every deployment in MacStadium is dedicated whether it's a single mini or a hundred Mac Pros or a thousand minis, you've got root level access to that so you've got the full control over what goes in what goes out, full visibility to that. Whereas we as a supplier, do not have access to your data, do not have access to your guest OSs, that's the kind of abstraction that we've been (a) forced to build in, but [b] actually welcome building it as well. It really separates the security responsibility between us.

Greg:

Looking ahead, we talked a little bit about 6.7 and also too Apple is a hardware company. As such, they keep coming out with hardware that it may look like this, it may be this wide, it may be this deep. There's some rumors about a new Mac mini, Mac Pro, the iMac Pro just came out recently. Obviously Apple hasn't really focused on the enterprise client in quite a while. It's one of the things we try to do a stop-gap on. Are there any of those other technologies that Apple’s come out with that you're looking at to implement?

Ray:            

Really not too much. The advantage for us is that we're running on VMs. We've got these Mac Pros that are hosted by MacStadium. We really want to ensure that we have a stable environment that we can test on and if there's new hardware that's faster, that's something that we might look to but this is the advantage in my opinion of running on the VMs you don't have to worry about trying out these different pieces of hardware. So from our point of view, we're just standardizing on the Mac Pros and they work pretty well for us.

Alex

I will echo that. I think a lot of the value we get in running within VMware is we really have the opportunity to configure the environment a little differently on top of that same hardware. I think to your point, if there are some performance improvements that maybe a case to consider upgrading some of the underlying infrastructure, but largely we just want to make sure that we have enough capacity and enough speed to keep the developers happy.

Josh:     

I'm going to use my phrase of this is Echelon II levels again because for us on the CI side is there is the question of, “what's the speed the CPU?” Because that greatly affects how fast something runs like the newer CPUs just run faster. Then there's also, how many cores do you dedicate to a VM? How many gigs of RAM? The complexity in that question is ... And it all depends on what language you're running on that VM because if you're running Ruby for example, then that's bound to one core. Node is bound to one core. But if you're using Swift and Xcode like you should be if you're using Macs, those are multi-core, but are they multi-core during the build process? Are they multi-core during all the other bits that they're doing?

Josh:        

So usually the more cores that you give to a VM doesn't necessarily make the build faster. Sometimes you'll kind of learn where the peak is. So in part, we want to optimize for developer happiness, things need to run fast, we need to make it cost-effective. The newer CPUs are going to give us the quickest win. The iMac Pros that are coming out ... That's really interesting, we haven't experimented with those yet. We're starting to do lots of other experiments with giving more cores and making sure that people are using them effectively because then you get into compiler tricks and how you can actually speed that up. Then there's also caching, the land of CI just gets into this fun part of "there are actually a million little knobs to tweak," and then there's one big easy knob of like faster CPUs.

Greg McGraw:       

It's not by accident, but that's how we rack the iMac Pro right there, in a data center. Because again it is a faster processor, it's a little bit more super-charged than the Mac Pro, and for certain applications it really has some value. First of all, I think all of you have answered the fundamental question. You also had a DIY environment for Mac to basically support your dev teams. Then you moved off to outsourcing that where you could and sort of improve it as you go along. I'm sure many of you attendees in the room have lived that same life; I’d love to just get your final thoughts in terms of your overall approach and overall environment, and maybe some helpful hints and tidbits for the audience.

Ray: 00:40:04            

My recommendations are if you guys have questions, ask us once we have some time because I'm sure we have very similar questions or perhaps we don't have the answers but other people in the audience might have the answers. Everybody getting out there and just asking the questions that you've got about those environments would be helpful because I found the macOS environment to be one of the environments that is not the easiest to find answers on Stack Overflow for. But my final thoughts are going back to what Josh said – the tuning in VMware has been really, really helpful to us. We found that we have a lot of CI jobs that really don't need very many CPU cores.

Ray:  

They're running tests, and they're running functional tests, and they're not doing a build. So using VMware we can really leverage these Mac hosts to their full potential. We can run hundreds of VMs on 12 hosts because they really only need one core to run the operating system and to run some functional tests. Then we can parallelize our functional tests and we can get them done really fast and then we can also have some of the VMs that have eight cores so they can do our Xcode build as fast as possible. This is something that's pretty powerful with VMware, I feel like. It's something that we wouldn't really have the capability of doing if we were doing bare metal hardware and having an infrastructure provider like MacStadium is huge to us.

Ray:  Developers are getting more and more used to having cloud providers. I can't tell you the number of times somebody's come up to me and they'd be like, "Hey, how come we can't just use AWS and just deploy a Mac over there?" I'm like, "Oh man, if I could I would." I'm being completely honest, I want the easiest tool for the job. So having something that can resemble that as close to possible is something that MacStadium can provide to us with utilizing VMware. VMware has some nice automation such as DRS which allows you to kind of treat it as a cloud as close as possible. I really don't want to have to worry about tuning it and doing things like that. I want that to get figured out by itself and DRS has some of those capabilities. If you've seen the other talks, you've probably seen more information about that.

Alex:

Definitely look at your options. I think there are things like Travis out there that you may be able to use and may meet your needs. If you find yourself requiring more secure environment-

Josh:          

Then continue to use Travis.

Alex:       

I'd say invest in the automation and make sure you're able to provide a stable service. A lot of the back and forth we worked on with the development teams was getting them to understand the value of consistency and then helping draw the bounds where they understood when the test fails and the infrastructure's healthy, that means the test is working. That does not mean to call us and panic. "The system's broken," is what we would hear. It took a while to gain the confidence of those folks that the tools were working and very healthy. I think that's something where you want to have tests that you can show everything's healthy and push that back on them to figure out why their tests are failing or why somebody committed something that's ultimately breaking the pipeline.

Ray:

Completely true. I think a lot of us are very small teams. When you're that small of a team, that much back and forth just creates so many problems. I'm completely with you there. It took a while to build that trust, because our environment was so unstable for the longest time. A lot of times, I'd look at it and, "a test failed." "That is not my problem. This is CI doing its job." Once you can get it to be very stable, then you gain the developers' trust and they get what they need to get out of it.

Josh:        

We were having a bit of a debrief in the speakers room before coming here, and we were all talking about our artisanal, one-of-a-kind setups. We've all got little differences, but what we also realized is that there is a tremendous amount of similarities. We're all doing maybe something a little bit differently to achieve a different means, but it kind of reminded me of how we do it. We need to talk about it more so there's more that we as Travis want to talk about how we use VMware, how we use MacStadium, but how we essentially provision 55,000 VMs per day and why we do it in a certain way and how we build these VMs. Also, how we need to share more so we can actually bring improvements across this.

Josh:             

I think the biggest improvements that are coming to VMware that we've seen incrementally throughout the years have been the automation improvements that are built into the APIs. For us, we don't want to open up the HTML5 client and click around. We want this to be as automated as possible so we can hook this into CI jobs, we can hook this into CLIs that our developers can run when doing maintenance tasks. As you said, we've got an incredibly small team. Across the six million builds that we're doing per month, and those are builds not jobs, we've only got two people on our Mac team. While we've got eight people in our infrastructure team, that's two people on Mac servicing our VMware cluster, working with MacStadium, working with our developers and how to make that better.

Josh:           

A lot of this is us planning for how we can continue to grow, because Mac use is not shrinking. I'm betting there's 80% of people here have got an iPhone. You're all here because you're interested in Mac in some shape or form. Is that an iPhone or ... yeah, there we go. At least we've got one person here with an iPhone. I think that's me kind of sidetracking and waffling a bit.

Greg:

Preston?

Preston:         

I'd say one thing, to bridge off that point, is that MacStadium wants to facilitate a place where not only you use our infrastructure but where everybody can collaborate. It goes to that same point. Everybody has the same chain of thing that they have to do and accomplish. If they're not stepping on each others' toes, they're internal. We're trying to get something done. Like anything else, the community is a huge portion of that. Sharing that back on how things work is what we want to be able to do, because that one small company may have figured out how to do something that multi-billion dollar companies might not have seen. That's the difference is everything running perfectly well.

Greg:         

We're not competing on best practices. Your companies may be competing on your core products, and that's fine. But if we can share those best practices, everybody wins. One final question, since you brought it up: AWS. As you know, you can't run macOS on a Windows machine. You can run Windows and Linux on a Mac machine. A couple of our customers who have spun up environments and are working them well through their CI process for Mac builds, they're contemplating or have already brought over some of their Android builds. I’m just curious – this is not a test – I'm just curious, have you considered that as well?

Ray:          

Yeah. My team is very, very small. We can barely support the desktop applications team and, technically, the mobile team, but it's really just the iOS side. The Android side today, they don't actually utilize AWS themselves. Like I was mentioning, Box is very different depending on what team you're on. They use core productivity engineering teams' infrastructure, which is actually in-house infrastructure, which is ... I forget what it is to be completely honest. I don't really deal with the Android side of things. I do know that we do have some blade servers hosted by you guys. We run a DHCP server there. We do run some Windows machines that we need to have them as long running.

Ray:          

AWS, one of the good things you can do there is you can spin up a machine and then you can throw it away and you can pay for it for its usage. If you have something longer lived, that's not really the best purpose. It wouldn't be my recommendation to run that necessarily on AWS. That would be something that we're actually looking to bring over to the blade servers. Stuff that we do continuous testing on for our staging environment for functional testing, for example. We could run it on AWS, but it's a little bit more expensive.

Greg:          

Got it.

Alex:        

I can talk to that a little bit. We run Android emulation on HP blades that live in our MacStadium environment. That's really because on the Android side, you have the opportunity in AWS to run a lot of your build. That's where we'll run Android builds. We have an Android build pool that we maintain in AWS. We have an Android emulation pool that we maintain on a combination of the Mac and kind of more commoditized hardware. Then we have our iOS build pool that we'll maintain. That's really the three things that are giving developers on the iOS and Android side the opportunity to run unit tests and run UI tests, as well as all their other build needs.

Josh:        

Can you repeat the question just once more so I can make sure I don't sidetrack too much?

Greg:         

Just any Android, any non-Mac builds, moving over to the Mac infrastructure?

Josh:           

Right now, we use a mixture of AWS and Google for our Linux. We use MacStadium for our Mac and iOS. We're not looking at moving our hosted usage, our hosted Linux over. What we are interested in is how we can use VMware and Macs with on-prem clients. When they're using Travis CI Enterprise, they get Mac and Linux and Windows all within one configuration and setup. It's, I guess, less complicated licensing wise and also less complex than having to have, "Here's your Linux and here's your Mac." If we can give people the simplicity of having an entire CI setup on VMware, we see that as a very interesting product.

Greg: 00:50:01

Excellent. Well, thank you very much for sharing your comments, insights, and experiences. We've got about 10 minutes left. I'd love to open up the floor for questions. I'm sure you've got some questions you'd like to ask the panel.

Speaker 1:

Actually, I have a question, maybe for the MacStadium guys. How do you guys deal with managing the infrastructure? Do you have out-of-bound stuff? I mean, in some of the pictures, it looked like you guys have external PCI chassis for maybe 10 gig cards, but I didn't see anything for management. How do you guys deal with that?

Preston:            

As far as managing ESXi and things like that?

Speaker 1:           

I'm talking about if you have a drive fail, you have a PSOD or something like that. Is there anything that you can do so you don't have to actually be in the physical colocation space?

Preston:

Depending on what kind of hardware's actually sitting there, you can do remote boots. It's basically when we give access to the customer. Even through the dashboard, they have access to the power controls. With Apple hardware, you can set things to do restart on power on. You can set that pre-installation.

Speaker 1:            

But with an HP, for example -

Greg:          

Knock on wood, our rate of failure on the Mac hardware is really, really low. In most of our data centers, we have full-time staff, especially with the Mac minis.

Speaker 1:          

With an HP, you can remote into an ILO? Do customers just not have visibility if for some reason it's not responding to ping or they can't spin up a VM or something like that? Then just reach out to you guys and then you guys reboot it or something like that?

Preston:

Essentially, that would work, because nobody's running a single host or two hosts. Everybody has multiple hosts. You have internal alerting that'll say, "This host is not responding." We're actually working on some extra deeper alerting, because the one problem we have, it being Apple hardware, even hearing from VMware, Apple hardware's a black box. They have new features in 6.5, like proactive HA, for instance. That can work on HP, that can work on Dell, that can work on Cisco UCS. That can't work on Apple, because nobody has any idea what's internally working on it. No clue what's going on there.

Preston:

Essentially, we rely on alerting and VMware saying, "This host is not responding," or the management network not responding, and then proactively basically removing that host from the cluster and putting a new one in its place. On the side, we can worry about, "What caused that one specifically to fail?" Everything just continues on. HA does it job, VMs and builds keep going. Transparent to what their customers may be or internally they're doing, they can keep running while that happens in the background and moving it in and out.

Greg:            

In some cases, we have a non-Mac or HP blade cluster for management of the Mac environment, both utilizing the same for basically stored images and so forth.

Speaker 1:          

Also, I heard you guys talk a little bit about the builds or the OSs that you make available to your customers or the end users. How do you generate those? Are there any specific tools used to build the OS and get Xcode on it, or does someone do it by hand? Is it automation?

Ray:            

For our desktop team, it's actually mostly done by hand. For the iOS team, we use Ansible for most of the provisioning. We'll take the base ISO file and then there's some things that we'll do manually. I don't really recall. I don't work on the iOS templates very much. For example, if it's going to take a 60-line Apple script, we might just do that one time manually and then save that as an image and then never mess with that again and then build off Ansible on top of that. That's what we do.

Speaker 1:           

The last question, do you guys use APFS at all? I know that it's not necessary with the Xcode. Have you dealt with it at all?

Preston:           

Yes. Got APFS working in 6.7. With 6.5, you had to create essentially a little hack and to ignore APFS to get [macOS] 10.13 running.

Speaker 1:

Basically, it told it to install in HFS Plus instead then?

Preston:          

Yeah. Now with 6.7, APFS is natively supported. Essentially, the only thing you're doing is via terminal before you boot up that virtual machine with the ISO attached is to configure and format the storage. It runs perfectly normal.

Speaker 1:

Do you have a lot of requests for it, or is it very uncommon?

Preston:

What'd you say?

Alex:

Do you have a lot of requests for APFS or is it pretty uncommon?

Preston:

Now, yes, with the fact that 10.14 is coming out very soon. Get lots of requests on, "Hey, when is 6.7 supported?"

Speaker 1

Do have any concerns with when you have a mass upgrade once whatever, what is it, Mojave comes out that there's going to be a lot of need to do that?

Preston:

When it started with 6.5 that they added update manager directly, that was a big help. No longer puts the burden on us to say, "You have to shut down everything. We've got to rebuild everything for a customer,” or “we've got to have running separate update managers." It's part of vCenter, you can do it. What we're doing now is empowering the customer to be able to have access to their vCenter. That's the way we're going towards that. You run your update and ta-da, there it is. You just basically do the updates for it.

Preston:

It's been a little tough sometimes with the multiple changes in ESXi coming out. VMware, if you're running on Dell or any other standard hardware, everything seems to work perfectly fine. Then you try it on Apple hardware and things are clearly broken. We ran into that with 6.5, the GA release. A lot of customers wanted to move to it and the internal M2 drives. Basically, they removed one of the drivers that was present in 5.5 and 6, and it was not present in 6.5. We were wondering, "What's happened?"

Speaker 1:

I think originally they didn't list the Mac Pro on the HCL and then they added it after…?

Preston

Then they added it, yes.

Greg:  

As a VMware partner, too, we very, very frequently provide that real-life comment back to VMware for their development folks. We send feeback back to Apple, too, 'cause at some point it'd be nice to get them in the same room, but that'll never happen.

Speaker 1:

Thanks for answering the questions.

Greg:

Thank you.

Preston:

Thank you.

Greg:

Yes, sir?

Speaker 2

I have a couple questions, too, along the same line as the infrastructure questions. How do you guys handle firmware updates to the Mac hardware that only get released in macOS upgrade cycles and also the networking side of it? How do you guys get 10G to the backend SAN with the limitations of the Mac hardware?

Greg:

Preston?

Preston:

Essentially, we use hardware boxes from Sonic that connect to the Mac Pros. That enables us to utilize 10 gig and fibre channel. That throughput has been a saving grace, because otherwise you have your standard one gig copper. Everybody here pretty much knows that Apple does not ... they're not enterprise. None of this stuff is enterprise. You have to basically use a little elbow grease to get these things to work in that sense. As far as firmware goes, we have done some updates to it, depending on when that hardware was actually released. We've noticed from the VMware side that there wasn't this need to constantly keep up with the change, because it didn't really make too much difference. It was more on the bare metal side versus ESXi side. I heard from VMware that they do a lot of guesswork on getting ESXi to run on Mac, simply because they really just don't know. A memory pointer, for instance. "What does this do?" "I don't know. Let's just see what it does.” “Oh, look. It works." "All right. We're good. We're set."

Greg

These are some of the things that we've spent our time and energy working on. We've got six patents around how we rack and stack Mac infrastructure. I mean, Mac mini has got a single power supply. In a data center, you've got A and B. Well, we create C. We deliver A and B power to every single Mac device, even a Mac mini. This is our patented sled where it has a PCI express Sonic box with some additional connectors to it, dual 10 gig cards, as well as fiber to the SAN. We've taken that, the Mac Pro in particular, and added the right connectivity and the right structure to it to modularize it to make it work in a data center. Then it becomes a very scalable Mac resource, just like an HP blade or a Windows or Dell and who else is in the room here. Yes, sir?

Speaker 3:

Hi. I work for a college in California. One of the things is that our whole infrastructure at the college is VDI. We don't have true computers on anything. One of the questions I have is, anybody thought about just ... we have a programming class that's interested in Xcode. For us to spend $20,000 on Mac computers is just not reasonable. Anybody thought about just virtualizing the Xcode program and of getting it onto Horizon 7? Is that even possible? I'm willing to work with Mac.

Greg: ‍

I'll take the answer from the infrastructure side. We have a university in London that uses our – we have a VDI software component; in fact, we just acquired the company that does that – with 75 Mac minis in a resource pool that they can expose to the VDI as a basically virtual Mac instance to the students. Again, it's that one-to-one relationship, but instead of buying 70 minis for that session, that mini is now allocated to that virtual instance of the mini using the VDI software. There are ways to really do that as opposed to hacking into all the Xcode and that kind of stuff. That's more of the infrastructure side.

Speaker 3:  

Thank you.

Greg: 01:00:28

Absolutely. Yes, sir? We've got time for probably one more question and we've saved it for you.

Speaker 4:

More of a selfish question. You're obviously running a SAN system. Is there anything to stop you connecting a smaller number of Macs up and running VSAN across them instead? Have you tried that? Can you run VSAN across a cluster of Mac minis instead?

Preston:

We've done it in testing and done it on Xserve as well, being that Xserve's deprecated for quite a while and trying to figure out, "What do we do with all this hardware?" We've run VCN on there, but the performance has really been somewhat sketch, mainly because hardware for the HCL. The ways to get the Mac Pro running on there as well ... Addo, I believe is the company, has external hard drives. You keep the M2 drive internally for cache and then storage on the external. It works, but we've seen better performance gains from still just keeping two standard fibre channel or IP-based storage versus the VSAN. Not necessarily sure if that's based on the Apple hardware itself or something else in the ether that's happening, why it's not as quick.

Speaker 4:

You could use the external PCI boxes and the PCI hard drive in there, have your 10 gig connection as well?

Preston

Yeah. Essentially, you still meet the requirements. The fact that the Apple Mac Pro is on the HCL, it technically is supported.

Speaker 4:  

That'll do it for me.

Greg:  

Thank you very much. Again, we're here for any questions afterwards. I encourage and ask you to please fill out your survey. This is useful data that we use afterwards. Obviously, you can enter into a drawing for a VMware company store card. Thank you very much for your time. Again, thanks again for coming.


Return to Blog Home