Who am I? Need for Speed (and Scalability, Availability, Reliability, Resilience, ...)

Image result for buckaroo banzai car

The fastest car? One that can travel between dimensions!

I've been interested in non-functional quality attributes for a long time. Also known as "ilities", QoS, etc. What do I know about performance and scalability? Architecture, design, programming and trade-offs for performance and scalability. Impact of technologies, configuration, JVMs, etc. Performance testing (including load and stress testing). Performance and DevOps. Impact of distributed systems on performance. Qualitative and Quantitative tools and methods for performance and scalability analysis and evaluation.  Application Performance Monitoring (APM), e.g. Dynatrace (including all tiers and user experience via browsers/mobile apps over WAN). Performance modelling and predictive analytics. It's REALLY not sufficient to know how your system is performing "now", you need to be able to predict the impact of changes and how well it will work tomorrow! Can you imagine watching the weather on TV and only being told that it was war, calm and dry and 26 today, but sorry we can't give you a forecast for tomorrow (oh, there was a cyclone coming, whoops sorry!)

Counting rain drops as fast as possible

My 1st programming job at the end of the 1st year university (1980) was programming data loggers for a company in Wellington, NZ. These were both "space" and "time" displaced devices. You dropped them off in the bush for a few months and came back to retrieve them (and the data). If a local farmer hadn't decided to use them for target practice (shotgun holes were occasionally observed) you took them back to the office and extracted the data from them and analysed it. Sort of very high latency "networks". I programmed one whose purpose was to fly through clouds (on a plane) and measure rain drop density and frequency as quickly as possible, maybe an early form of "Cloud computing"?!


Multi-core microprocessor


Speed isn't the only thing, concurrency can be more important. For example, how do you build a digital music synthesiser in the 1980s to support multiple concurrent "voices"? 8?

During my Masters Degree I and a friend (Associate Professor Mark Utting) decided that because the university computer science course hadn't provided any hardware courses we would fill in the gap ourselves with some extra curricular R&D by building our own microprocessor computer.  We designed it around a 6809E CPU (one of the 1st microprocessors with a proper memory architecture and a whooping 64KB RAM) and wire-wrapped and debugged it ourselves (we fried one memory controller worth $100 along the way). It had dual 8" floppy disk drives. We borrowed the physics department's oscilloscope for h/w debugging, which somehow got stolen from the locked computer science lab we were using (luckily they left our computer behind). I recall having to front the Vice Chancellor in his massive pent-house office to try and explain what had happened whoops. Once we got the h/w working we designed and bootstrapped and wrote a complete O/S (in BCPL) including disk drivers, utilities, file system, editor, Prolog interpreter and music synthesiser, etc. Because we were interested in harnessing increased processing power for applications such as Machine Learning and Music Synthesis we designed it as a shared-memory multi-processor architecture (around 2KB of fast shared RAM shared between pairs of cpu nodes).

Photos below of this computer (a bit dusty as it's been stored in the garage for years).  It was relatively expensive to build at the time as some of the bigger chips cost $100+. I think we spent maybe $2,000+ and 100+ hours designing, building and programming it. It had 51 ICs, the port on the lower left was to connect to another board via the shared memory. It consumed 10A at 5Volts, had an 8Mhz clock speed, and had a maze of wire wrap on the back!







Another photo showing the Digital to Analogue (DAC) converter chip (2nd in from the potentiometer) and CPU (next to the reset switch). Other big chips were the 64K Ram and Dynamic memory controller.



Machine Learning in (simulated) real-time

At Waikato University I'd become fixated on Machine Learning, specifically autonomous undirected learning. So for my MSc research/thesis (which stretched over a couple of years eventually and was worth 1/2 the course weight) I decided to "solve" it.   The idea was to have a (simulated) child robot learner in a (simulated) robot arm blocks stacking world (simulated arm, simulated naive physics). I digested every paper and book on the subject of machine learning, logic programming, philosophy of science and cognitive child psychology. The proposal was eventually to write and experiment with a program for "paradigm-computer learning". From Kuhn's Paradigm shifts I wanted to explore what would happen to a learner who is primarily directed in their actions and thoughts by the current paradigm of how they believe the world works. With the ability to develop a theory (concept, causal laws) consistent with their current paradigm, or decide to reject the paradigm and throw out the current theory and start over again with a new paradigm.

Machine learning is really all about performance. At one level it's just search, but the search space is enormous. You therefore need algorithmic and heuristic tricks to cut-down on the search space. Even then it took 3 days on VAX 11/780 time to run through a couple of paradigm shifts and a few dozen actions. 




PhD Machine Learning Research at UNSW

I wrote a very fast implementation of the Rete algorithm at UNSW (AI tuple/rule pattern matching rules-engine, my approach was based on CRC codes, it was the fastest in the world at the time).


PhD study at UNSW (1986-199) developing new machine learning algorithms for autonomous learning programs in temporal first-order domains. Developed several promising heuristic incremental algorithms for learning first-order horn-class logic.  I prototyped new algorithms in Prolog and then for speed often rewrote them in C to be as fast as possible (running on a RISC Pyramid computer).




UNIX Systems programming

I worked for 5 years with a startup in Sydney (Softway) and then ABC (TR&D) doing UNIX systems programming. A lot of this was performance focussed. You can't fiddle around with the UNIX kernel unless you keep performance in mind.  Also did s/w engineering of  distributed systems with real-time performance requirements, including Oracle mirrored distributed data base system, and Optus voice mail protocol/integration. At the ABC I designed and implemented an efficient multi-media file system cache for the D-Cart system.

Middleware Performance Engineering (J2EE)

In 1999 I moved from the CSIRO software engineering initiative (invited due to my my recent extensive Java experience ) to a research oriented role with the Division of Maths and Info sciences (later the ICT centre) working for Dr Ian Gorton on a new Advanced Distributed Software Architectures and Technologies Project (later renamed to Software Architecture and Component Technologies). I was involved in the Middleware Technology Evaluation Project which conducted rigorous testing (benchmarking) of COTS technologies in the enterprise space to understand tradeoffs in the use of "standards" based middleware technologies (e.g. J2EE) where different vendors may have implemented the standards in different ways and with different performance and scalability characteristics. I.e. If we wrote a single benchmark application according to the standard (possibly with architectural variants) would (a) run and (b) how fast? (c) how scalably? on each vendors implementation?

Looking at architectural variations was a key aspect to this project as often the standards suggested or allowed more than one pattern of use for the various component types, which would work better and why and what were the tradeoffs? We started with a benchmarked designed for Corba (written in C) to emulate on online stock brokering system. I was involved in porting this to Enterprise Java (J2EE as it was then known) and designing how the architectural alternatives work work with a single code base. The main alternatives were different Entity Bean (EJB) persistence models (E.g. Container Managed Persistence, CMP, and Bean Managed Persistence, BMP), but also use of Stateful Session Beans, and number of servers in the AppServer cluster (1 or 2).  I was also involved in setting up the testbed (h/w, database, test drivers, etc), and deployment of the benchmark onto multiple vendor products (or trying in some cases without success). I was the product expert for deploying and debugging the benchmark onto the Borland J2EE AppServer, SilverStream J2EE Server, and Sun J2EE AppServer (with no success).

I also had some experience with the ECPerf benchmark (leading to involvement in the SPEC Java committee). I was involved in running the benchmarks and variants on multiple products (as the h/w was in the Canberra lab so all the vendors products had to be installed, the benchmark deployed and debugged, and then run multiple times for each architectural variant and results collected and analysed). I was also involved in setting up and configuring the database drivers, and the JVMs which turned out to be more significant than first realised. A lot of effort finally went into the setting up and tuning of JVMs and jdbc drives, as we found that for JVMs the vendor product, type of JVM, and Garbage collection settings, number of containers and JVMs and CPU cores, etc had significant impacts on performance and scalability.  It was also time consuming to do and sometimes we broke the JVM. I found, reported and worked with SUN to fix a severe scalability flaw in a new version of their JVM related to thread management (from memory).

We migrated the benchmark through several changes in the J2EE standard (e.g. EJB versions), and I planned and prototyped enhancements including the use of JMS and Web Services in the benchmark.

Because I had become an expert on the J2EE standards and performance engineering during these experiments, and some exposure to ECPerf, I was invited to represent CSIRO on the Standard Performance Evaluation Corporation (SPEC) Java committee.  For several years I was involved in the development of the SPECjAppServer benchmarks,, and reviewing of member submissions, and negotiations about new benchmarks (e.g. enterprise SOA).

During this time I also conducted and published and presented research around J2EE performance and scalability at international conferences and in journals (e.g. caching, JVM, architecture, etc), presented at industry and professional conferences and training events, and edited the 2nd edition of our detailed report and analysis on the J2EE products (published by CSIRO and Cutter).

I also conducted work and wrote reports for several consultancies (e.g. Fujitsu and Borland around performance and scalability and compliance to J2EE standards. E.g. Fujitsu had interpreted the standard strangely and required every different EJB to be deployed in a separate container making deployment a nightmare. Was this compliant or not? It did run fast!).

I also developed a research proposal to conduct a performance and scalability evaluation of J2EE products with "novel" architectures in conjunction with INRIA and ObjectWeb (their Fractal J2EE server used novel internal architectural mechanisms). This joint proposal for travel funds from the Australian Academy of Science and French Embassy Fellowship scheme was successful, but I was unable to take it up due to changes in CSIRO project structures.

Other research on these platforms also included code instrumentation and JVM profiling to determine how long was spent in each sub-system in order to understand the performance and scalability characteristics better. I also discovered that by having sufficient information it is possible to approximately model and predict the performance and scalability under different loads, and also provided unique insights into potential bottlenecks and the potential speed-up if reduced, and why some of the vendor products had better performance or scalability than others. This was a very early precursor to my work with service-oriented performance modelling in NICTA from 2007 onwards.  I also supervised several students during this work including an evaluation of ebXML middleware, and an experimental analysis to understand the interaction of J2EE application object lifetimes on the JVM garbage collector strategies and settings and performance and scalability.

I presented published papers and attended workshops at: Middleware 2001, IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg, Germany, 2001; Middleware 2005, ACM/IFIP/USENIX 6th International Middleware Conference, Grenoble, France, 2005; and IPDPS 2003, International Parallel and Distributed Processing Symposium, 2003, Nice, France.


Grid and Large Scale Distributed Systems

May next major performance related are of experience was Grid computing. This was all about harnessing the available resources (possibly distributed) to solve large scale complex and potential data intensive parallel processing problems as fast and as economically as possible.  Many ideas were being developed around this time which I encountered (and explored) in their earlier research forms including Hadoop, (Map/Reduce approaches to mapping data sets to available processing resources),  Cloud (web services IaaS), etc. The architectural approaches were all about loose-coupling, and use of horizontal/commodity resources via \web services (including deploying, securing, and consuming end-user applications as web-services to distributed resources).


The Middleware Technology Evaluation project at CSIRO was phased out in 2003 and I was moved to another project in the Grid space, architecting scientific grid computations (on a Grid cluster computer, for the dynamic execution of enormous astronomy image dataset processing workflows using web services).

Note: Interesting to see that the latest (2015) CSIRO Astronomy and Space Science Australia Telescope Online Archive,  Upgrade Specification: Version 5, traces it's origins back to our CSIRO work in 2003.

And that "Big Data" in 2003 was only 2TB! Actually the real problem wasn't the data size rather that the data sources were heterogeneous, didn't have meta-data, and the pipeline was complex. See this presentation.

And that many of the challenges identified for the "production" version (in 2004) would have been addressed by my evaluation and R&D of OGSA in the UK over this period (but the CSIRO project had been moved out of CSIRO back to the ATNF by the time I got back in 2005).







I also developed a funding proposal for middleware for managing the SLAs of web services based on a combination of monitoring, autonomic computing, and elastic resourcing (e.g. deploying services dynamically to servers with spare capacity to meet demand and SLA requirements - a bit "cloud like" perhaps).

Based on my previous work with J2EE middleware R&D and evaluations, and the Grid architecture work, I was invited to UCL for a year ("of benefit to CSIRO" so I could take leave and come back again) to work for Professor Wolfgang Emmerich on an EPSRC funded UK eScience project to manage the evaluation of Grid middleware based on the OGSA technology stack across 4 locations.


This was an interesting project managing distributed resources and providing the main technical input. I also interacted with stakeholders and other interested parties in the UK eScience area, and presented regular technical updates (e.g. at Oxford, town hall meetings, etc).  I managed the installation, configuration, securing, testing, and application/benchmark development, deployment and execution and analysis. We deployed a fully functional OGSA middleware infrastructure across the 4 locations (2 in London, Newcastle and Edinburgh) which included local and centralised service registries for infrastructure and end-user service discovery and consumption.

We experiment with mechanisms to supplement with OGSA middleware in order to deploy, secure, distributed, look up, and consume end-user web services across the 4 sites which included the ability to load balance the services across the distributed resources. Some of our observations about missing features and architectural limitations included aspects that have since become common in cloud such as: metering and billing of resource usage, ability to deploy end-user web services across the resources and load balance them, virtualisation to assist with security and isolation and end-user management of applications, and different classes and pricing of resources for batch vs. interactive jobs etc.



Real-time sensor event stream processing


The next significant performance and middleware project I worked on was to manage a contracted theoretical and experimental evaluation and report of the Open Geospatial Consortium (OGC) Sensor Web Enablement standards and technologies. This was a complex set of interacting XML based web and GIS-aware standards and open source middleware for collecting, organising, searching, managing, and disseminating spatial and sensor data in a distributed style across multiple locations and brokers. We developed some benchmark applications to inject sensor data and process it on the fly, store some of it in databases, and combine the stored and real-time results with queries for real-time processing in conjunction with historical data (trends etc).

This turned out to be fairly demanding of both the standards and the middleware and we discovered performance and scalability issues and architectural/middleware issues (e.g. it turned out to be easy but hard to detect or remote "event loops" caused by brokers subscribing from each other's data sets). I also investigated changes to some of the standards to enable "demand based subscriptions" for events and modelled alternatives to understand potential improvements in performance and scalability and resource usage (e.g. being able to subscribe and/or receive notification to only changes to values by a certain minimum percentage or once in a specified time period, etc). I also evaluated open source and commercial complete/event stream processing middleware products in conjunction with this project (wrote benchmarks for, testing under increasing load, analysed behaviour etc) such as Coral8, Esper, etc.  Note that some of the ideas from these standards and technologies have become mainstream through numerous open source data analytics and have also found there way into cloud technologies (E.g. Amazon Kinesis). 

I  managed a project to build a GUI application to integrate streaming (real-time) data with historical data for processing and visualisation at CSIRO ICT Centre 10+ years ago (E.g. from sensor networks via Jabber/XMPP protocols, real-time in-memory CEP of streams, and persisting (some of the) streams in databases/memory for subsequent lookups and processing etc, c.f. Kappa architecture and Apache Zeppelin)).

10 years of Performance Modelling R&D and Consultation

For the last 10 years I've been a senior researcher with NICTA, and then CTO/Chief Scientist of a tech start-up specialising in performance engineering (via a performance modelling tool which automatically builds performance models and enables predictive analytics from APM data) to address enterprise and cloud performance, scalability, capacity and resource usage/price risks.

Clients and projects I've been the primary technical consultant in the last 4 years are (not in order);


  • Telstra (Cloud Migration)
    • Telstra need to migrate multiple applications hosted for customer from a physical server environment that was being retired to a newer virtualized private-cloud platform. Their customers had concerns about performance, and if and how to re-architect for the new platform. We built performance models automatically from Dynatrace APM data from the old and new platforms, wrote and ran benchmarks, and used load test results to calibrate the models. 
  • Department of Employment (Capacity planning from Cloud testing to in-house hosting from Dynatrace APM data)
    • This problem involved capacity planning from the User Acceptable Testing environment which was being run on AWS. The final production application was being run on in-house infrastructure and they needed to know how many servers to provision (given the delay between ordering and provisioning). Challenges included different platforms, and different workloads and transaction mixes.  We predicted the resource requirements in terms of lower and upper 90% error margins, and they decided to over-provision initially.  Modelling also assisted with planning for load testing (which was unable to be completed due to lack of time).
  • NBN Co (end-to-end Performance modelling of Fulfilment system from SPLUNK and AppDynamics data)
    • This was a multiple-year project. We built several different performance models from different data sources including log files, SPLUNK data, and PoC with AppDynamics. The problem was to find and correlate sufficient performance data to produce an end-to-end/top-to-bottom performance model of their entire system in order to understand performance problems with a major sub-system. We also obtained data and built a performance model focussing on the messaging integration layer between systems (number, type and average/max length of integration queues). We spent a lot of time building a custom APM solution on top of SPLUNK which was only partially successful due to limitations with the data ingested into SPLUNK, correlation problems, and problems processing large amounts of structured transactional data in SPLUNK. We were able to prove that the problem was entirely in the system of concern (i.e. excluded other possibilities, but due to lack of visibility into this system we could not identify the actual source of the problem or model remediation options).
  • Family Court (performance modelling from ManageEngine APM data)
    • This project built performance models initially from ManageEngine APM data for the document uploading and retrieval system, looking at likely changes resulting from new document types, increasing workloads, and the proposed integration with the federal court systems.
  • Department of Immigration and Border Protection (Performance modelling of Visa processing system from in-house Compass APM data, for improved DevOps)
    • This project integrated our performance modelling tool with the DIBP in-house APM tool, Compass which was used in all phases of the s/w lifecycle for performance monitoring. We were working primarily with the testing team to improve the number of tests that could be run but also shift left and right in terms of the DevOps activities. We built a number of performance models for their complete visa processing system, focussing on the Data Analytics services (which used R). to model performance, capacity and deployment optimisation and SLAs. We did some detailed modelling of deployment alternatives using a Bin packing algorithms to optimise across memory, time, and SLA variables and suggested some alternative deployment tactics for future R models.
  •  AUSTRAC (architectural risk evaluation of data analytics platforms, including Data Lake, Kafka, Elasticsearch, Greenplum data warehouse, Oracle, blue/green deployment, Kappa architecture)
    • We did some initial performance modelling based on architectural artefacts of their current and proposed Data Analytics pipelines, and then focussed on architectural evaluations of the evolving open source stack and possible vendor proposals.  Some of the issues including how frequently and how long it would take to regenerate state data using a Kappa architecture. 
  • ATO/Deloitte (performance modelling of the ATO eCommerce platform from IBM and Dynatrace APM data)
    • This project was challenging as we had to use 2 potential data sources for model building. The IBM data wasn't fine-grained enough, and the Dynatrace APM data wasn't complete. We built 2 models and used them to show the pros and cons of using the 2 data sources, what could be accurately predicted so far (and what couldn't), and what would be possible if complete/fine-grained data was obtained and proposed changes to their data capturing approaches to achieve this in the future.
  • NZPost (capacity prediction, load forecasting and failover modelling of tracking system from Dynatrace data)
    • NZPost PoC using Dynatrace data to predict the capacity and resource requirements of their parcel tracking system over the peak Xmas period. Included iterative model building over a number of weeks giving feedback on Dynatrace data quality and completeness as more Dynatrace agents were deployed, and load forecasting from longer term Dynatrace data combined with peak loads from the previous year. Also investigated failure and time to recover if one of the server failed during the peak period and remediation solutions. This was for a large complex BizTalk architecture.
  • Pre-sales/POC modelling of NSW government fuelcheck.nsw.gov.au and stateplus.com.au websites for AccessHQ
    • PoC models for AccessHQ for 2 of their clients focussing on building models from publically available data of web site performance (e..g using Dynatrace UEM data, and Chrome developer tools timeline performance data). Built demonstrators showing the impact of an increase in load due to increasing numbers of users over time using simple Markov models of typical user interactions. Also developed a simple "manual model building" method and tool support for AccessHQ to enable some of their consultants to provide added value services early in the testing lifecycle by building simple performance models of client systems to highlight pros/cons of different testing approaches and risks to focus on during testing. Example used was the Macquarie university student web portal (looking at impact of caching and database indexes etc).
  • DHS (modelling performance and business impact of proposed online chat system
    •  e.g. improvement in online transaction completion rate, reduction in phone calls, etc.
    • PoC model of proposed chat system performance and scalability.

R&D of automatic performance modelling from APM Data (Dynatrace, Compass, SPLUNK, AppDynamics), published and presented papers at ICPE2016, LT2016, and WOPR25 (2017) invitation only workshop. Innovation patent holder, represented us at Dynatrace partner events in Spain and Munich, visited Dynatrace development lab in Austria to discuss integration options. Conducted cloud benchmarking for a client (March 2017, including AWS, Azure, Vault systems), developed VM benchmark for a client (Telstra private cloud). Invited to Dagstuhl seminar on “Software Performance Engineering in the DevOps World” (2016). Invited participant at WOPR25 workshop, Wellington, NZ February 2017 (integration of performance modelling into APM+DevOps) and discussions about internet/cloud scale performance testing and engineering (e.g. with Dynatrace, flood.io, Xero and Facebook experts).

As a result of our technology partnership with Dynatrace I've had 5+ years extensive experience with using Dynatrace to analyse, find and remediate performance "issues" with many of our clients technology stacks (as well as using it in our lab for experimental purposes and developing integration solutions for it and our tool). Even though this isn't the focus of our performance modelling tool and method (which is future/predictive looking), one of the phases in our process is to ensure input data quality, by carefully looking at the APM data and identifying any obvious problems, missing or dubious data, and working out how to sample the data (how much, which periods, etc). I've also had some experience evaluating and trialling AppDynamics, ManageEngine, Compass (DIBP), and some open source tools. 

Who else is doing something similar? The only other similar tool and approach for automatic performance modelling from APM data (that I'm aware of) is based in Germany, Retit.  They use Palladio for the modelling side.

Senior Research Scientist, and Consultancies, NICTA

Working in Emeritus Professor Ross Jeffery’s Software Systems Research Group, and the Canberra based e-Government focussed project, conducted “Use” inspired R&D with multiple government and enterprise clients to develop a tool and method to reduce risk for large scale software “system of systems” projects, focusing on performance, scalability, capacity, and reliability. Outcome was a model driven tool for performance modelling (Service Oriented Performance Modelling).

Clients and projects included (not in order):


  • CBA (mobile internet banking platform modelling)
    • Performance modelling of Commonwealth Bank Mobile banking application, taking into account new and legacy backend systems and other loads. 
  • VW (Germany) pre-sales modelling of interactive web-site (car configurator) focusing on html5 framerate metrics.
    • Working with a German based partner we developed a prototype PoC model for the VW car configurator web site, using publically available data (e.g. collected via Chrome developer tools), and focusing on html frame rate performance. 
  • Queensland Department of Health (Integrated Radiology service pilot performance evaluation)
    • Using benchmark test results and proposed role out plan, network diagrams, and proposed loads, build a performance model of the Queensland Health Informatics system. This focussed on network latency and bandwidth, and also the centralised radiology imaging processing system.
  • Department of Defense
    • RPDE/DSTO/CIOG, ISR Integration solution validation, complex project involving multiple stakeholders, multiple sources of data including conducting performance testing of ESB topologies in our lab, and modelling of alternative ESB architectures, and number and locations for services and servers.
    • This was a large multiple-year, multi-phase project. We were paid several million dollars for this work and won the NICTA impact project award for wealth creation.
  • Department of Health and Ageing (health emergency web site performance assessment)
    • Model of the proposed Health emergency web site taking into likely peak load during an epidemic outbreak, and different classes or users and documents required to be served, and use of multiple redundant systems for reliability. Included modelling of time to recover if the load spike was even worse than expected and impact on profession user SLAs.
  • Department of Climate Change, Carbon Emissions Trading System
    • Emissions Trading Scheme tender evaluation, performance analysis of auction performance requirements and submissions
    • Detailed requirements analysis of the proposed auction type to determine performance and scalability requirements
    • Contributed to writing the tender document (non-functional requirements), and conducted initial evaluation of proposals by building performance models based on supplied data.
  • Pre-sales model for financial trading platform migration from single broker to distributed sharded broker architecture
  • Department of Education
    • Performance modelling the migration of a large distributed high volume mission critical application from mainframe to mid-range solution. Looking at performance, SLAs, and resource requirements in particular, but also architectural issues surrounding the migration including deployment, and load balancing etc.
  • Department of Immigration, security vetting system
    • Performance and capacity evaluation of Visa processing system (we also did some more work on the same system a few years later see above).
  • Department of Defense
    • Performance modelling and problem diagnosis of online security vetting system.
    • This project used Dynatrace data to build a performance model, supplemented with other sources of data including workflows to model the impact on end users of possible network or system issues, and suggested possible remediations (the network was not the problem as everyone had assumed).
  • ARC, Grants Management System
    • Performance modelling of new research management system developed using “model driven development”. Built a performance model from a combination of the "model driven development" artefacts and our own in-house Java monitoring prototype on the system running in our lab. Predicted performance and scalability of the system on the target production platform at peak load, and remediation of issues related to automatic code generation of inefficient J2EE persistence code.
  • IP Australia
    • Performance modelling of the new IP Australia Trademark and Patent search enterprise applications, focussing on issues around public and in-house users doing complex searches, and overseas robots putting too much load on the system. Included modelling search , database, image retrieval, and web components.
  • VANguard
    • Performance modelling of whole of government online security service system to provide authentication, authorisation and identify services to other government and agencies etc. Invented meta-model for SOA performance modelling and built initial prototype tool and simulation engine and trialled it on this project. Performance questions we answered included how to role out the service to agencies with different service mixes and loads, and impact of adding new services composed from mix or existing and new services (performance, scalability, resources, etc).


Published and presented consultancy and R&D results at industry and academic conferences (e.g. ICSE, ICPE, QoSA, ISCOC, WICSA, ASWEC, etc) and NICTA run workshops and training courses, Member of international conference and workshop program committees (including SPEC Research Group), invited referee for journals and international funding bodies. Negotiated an international collaborative research project between NICTA and the INRIA Galaxy Project on Model Driven Performance Management.


Innovative performance engineering techniques that I've used or developed over the last 10 years include:

  • Generative modelling for exploring long-tail/network-scale effects
  • Automatic sensitivity analysis
  • Bin-packing algorithms over multiple variables for deployment optimization
  • Markov chains for modelling load arrival rate distribution complex business processes and user web interactions
  • Experiments with Spline interpolation and Bayesian inference for extrapolation of data. 
  • Experiments/prototypes with model-free transformation based approaches (data-to-data)
  • Simulation-free modelling using Markov models only (accelerated with GPUs)
  • Sophisticated incremental/dynamic data sampling strategies to build models from the smallest possible sample size given constraints such as: enormous data sizes (TB), data from long time periods (e.g. months), or a time limit to build models in.
  • Detailed and ongoing vendor and client-supported, experimental evaluations of APM products (e.g. Dynatrace, AppDynamics, NewRelic, CA, ManageEngine) for ability to support performance modelling and prediction (e.g. transactional metrics depth of coverage, availability and semantics, APIs, etc).


Cloud elasticity & cost

R&D on the impact of cloud elasticity on real applications. Using performance data from three real client systems I conducted benchmarking on multiple cloud provider platforms and modelled the impact of different instances sizes and type mixes, instance spin up times and workload patterns/spikes on response time SLAs and cost. Published results in CMG and ICPE.


Big Data/Data Analytics Performance and Scalability: Opportunities for Performance Modelling Data Analytics Platforms and Applications




Micro Services and Performance


Microservices architectures have become the next big thing. The theory is that more smaller services are better (for DevOps anyway). During ICPE2016 in Delft last year I had the opportunity to hear several talks on SOA, microservices migrations, modelling and performance, and participate in some discussions around this space. As a result I did some preliminary work with modelling the performance and scalability and resource implications of moving from a typical SOA (using existing customer APM data and performance models we had built) to microservices architectures incrementally. Some of the variables I looked at included how many complex services there are and how many (redundant) services they call (as this tends to result in the Zipf distribution effect observed in many SOAs where only a few services had the most service demand), how the services are sliced and diced to turn them into microservices, and how much overhead there is per microservice that was previously "absorbed" by having coarser grained services, and how many original services are actually retired vs. kept in use, etc.  It turns out that initially at least the complexity, service demand and response times may go up, so watch out. I haven't completed or published this work yet but if you think it may still be interesting let me know :-) Se this blog under Zipf's law for more info.

Comments

Post a Comment

Popular posts from this blog

Which Amazon Web Services are Interoperable?

AWS Certification glossary quiz: IAM

AWS SWF vs Lambda + step functions? Simple answer is use Lambda for all new applications.