SkyHive is an end-to-end reskilling platform that automates abilities evaluation, identifies future expertise wants, and fills ability gaps via focused studying suggestions and job alternatives. We work with leaders within the area together with Accenture and Workday, and have been acknowledged as a cool vendor in human capital administration by Gartner.
We’ve already constructed a Labor Market Intelligence database that shops:
- Profiles of 800 million (anonymized) staff and 40 million corporations
- 1.6 billion job descriptions from 150 nations
- 3 trillion distinctive ability combos required for present and future jobs
Our database ingests 16 TB of knowledge on daily basis from job postings scraped by our internet crawlers to paid streaming knowledge feeds. And now we have achieved a number of complicated analytics and machine studying to glean insights into world job developments at present and tomorrow.
Due to our ahead-of-the-curve expertise, good word-of-mouth and companions like Accenture, we’re rising quick, including 2-4 company clients on daily basis.
Pushed by Information and Analytics
Like Uber, Airbnb, Netflix, and others, we’re disrupting an business – the worldwide HR/HCM business, on this case – with data-driven providers that embrace:
- SkyHive Talent Passport – a web-based service educating staff on the job abilities they should construct their careers, and assets on find out how to get them.
- SkyHive Enterprise – a paid dashboard (under) for executives and HR to research and drill into knowledge on a) their workers’ aggregated job abilities, b) what abilities corporations want to reach the longer term; and c) the talents gaps.

- Platform-as-a-Service through APIs – a paid service permitting companies to faucet into deeper insights, akin to comparisons with opponents, and recruiting suggestions to fill abilities gaps.

Challenges with MongoDB for Analytical Queries
16 TB of uncooked textual content knowledge from our internet crawlers and different knowledge feeds is dumped day by day into our S3 knowledge lake. That knowledge was processed after which loaded into our analytics and serving database, MongoDB.
MongoDB question efficiency was too sluggish to assist complicated analytics involving knowledge throughout jobs, resumes, programs and totally different geographics, particularly when question patterns weren’t outlined forward of time. This made multidimensional queries and joins sluggish and dear, making it inconceivable to supply the interactive efficiency our customers required.
For instance, I had one giant pharmaceutical buyer ask if it will be attainable to seek out all the knowledge scientists on the planet with a medical trials background and three+ years of pharmaceutical expertise. It might have been an extremely costly operation, however after all the client was searching for rapid outcomes.
When the client requested if we might increase the search to non-English talking nations, I needed to clarify it was past the product’s present capabilities, as we had issues normalizing knowledge throughout totally different languages with MongoDB.
There have been additionally limitations on payload sizes in MongoDB, in addition to different unusual hardcoded quirks. As an example, we couldn’t question Nice Britain as a rustic.
All in all, we had vital challenges with question latency and getting our knowledge into MongoDB, and we knew we wanted to maneuver to one thing else.
Actual-Time Information Stack with Databricks and Rockset
We wanted a storage layer able to large-scale ML processing for terabytes of latest knowledge per day. We in contrast Snowflake and Databricks, selecting the latter due to Databrick’s compatibility with extra tooling choices and assist for open knowledge codecs. Utilizing Databricks, now we have deployed (under) a lakehouse structure, storing and processing our knowledge via three progressive Delta Lake levels. Crawled and different uncooked knowledge lands in our Bronze layer and subsequently goes via Spark ETL and ML pipelines that refine and enrich the information for the Silver layer. We then create coarse-grained aggregations throughout a number of dimensions, akin to geographical location, job perform, and time, which are saved within the Gold layer.
We now have SLAs on question latency within the low a whole lot of milliseconds, whilst customers make complicated, multi-faceted queries. Spark was not constructed for that – such queries are handled as knowledge jobs that will take tens of seconds. We wanted a real-time analytics engine, one which creates an uber-index of our knowledge to be able to ship multidimensional analytics in a heartbeat.
We selected Rockset to be our new user-facing serving database. Rockset repeatedly synchronizes with the Gold layer knowledge and immediately builds an index of that knowledge. Taking the coarse-grained aggregations within the Gold layer, Rockset queries and joins throughout a number of dimensions and performs the finer-grained aggregations required to serve consumer queries. That permits us to serve: 1) pre-defined Question Lambdas sending common knowledge feeds to clients; 2) advert hoc free-text searches akin to “What are all the distant jobs in the USA?”
Sub-Second Analytics and Sooner Iterations
After a number of months of growth and testing, we switched our Labor Market Intelligence database from MongoDB to Rockset and Databricks. With Databricks, now we have improved our means to deal with large datasets in addition to effectively run our ML fashions and different non-time-sensitive processing. In the meantime, Rockset allows us to assist complicated queries on large-scale knowledge and return solutions to customers in milliseconds with little compute value.
As an example, our clients can seek for the highest 20 abilities in any nation on the planet and get outcomes again in close to actual time. We are able to additionally assist a a lot larger quantity of buyer queries, as Rockset alone can deal with thousands and thousands of queries a day, no matter question complexity, the variety of concurrent queries, or sudden scale-ups elsewhere within the system (akin to from bursty incoming knowledge feeds).
We at the moment are simply hitting all of our buyer SLAs, together with our sub-300 millisecond question time ensures. We are able to present the real-time solutions that our clients want and our opponents can’t match. And with Rockset’s SQL-to-REST API assist, presenting question outcomes to functions is straightforward.
Rockset additionally hurries up growth time, boosting each our inside operations and exterior gross sales. Beforehand, it took us three to 9 months to construct a proof of idea for patrons. With Rockset options akin to its SQL-to-REST-using-Question Lambdas, we are able to now deploy dashboards custom-made to the possible buyer hours after a gross sales demo.
We name this “product day zero.” We don’t need to promote to our prospects anymore, we simply ask them to go and check out us out. They’ll uncover they will work together with our knowledge with no noticeable delay. Rockset’s low ops, serverless cloud supply additionally makes it straightforward for our builders to deploy new providers to new customers and buyer prospects.
We’re planning to additional streamline our knowledge structure (above) whereas increasing our use of Rockset into a few different areas:
- geospatial queries, in order that customers can search by zooming out and in of a map;
- serving knowledge to our ML fashions.
These initiatives would probably happen over the subsequent 12 months. With Databricks and Rockset, now we have already remodeled and constructed out a wonderful stack. However there’s nonetheless way more room to develop.
