How Sustainable is a Solar Powered Website?


In September 2018, Low-tech Magazine launched a new website that aimed to radically reduce the energy use and carbon emissions associated with accessing its content. Internet energy use is growing quickly on account of both increasing bit rates (online content gets “heavier”) and increased time spent online (especially since the arrival of mobile computing and wireless internet).

The solar powered website bucks against these trends. To drop energy use far below that of the average website, we opted for a back-to-basics web design, using a static website instead of a database driven content management system. To reduce the energy use associated with the production of the solar panel and the battery, we chose a minimal set-up and accepted that the website goes off-line when the weather is bad.

We have been monitoring the solar powered server for 15 months now, and we have collected data on uptime, energy use, power use, system efficiency, and visitor traffic. We also calculated how much energy was required to make the solar panel, the battery, the charge controller and the server.

Uptime, Electricity Use & System Efficiency

The solar powered website goes off-line when the weather is bad – but how often does that happen? For a period of about one year (351 days, from 12 December 2018 to 28 November 2019), we achieved an uptime of 95.26%. This means that we were off-line due to bad weather for 399 hours.

If we ignore the last two months, our uptime was 98.2%, with a downtime of only 152 hours. Uptime plummeted to 80% during the last two months, when a software upgrade increased the energy use of the server. This knocked the website off-line for at least a few hours every night.

One kilowatt-hour of solar generated electricity can serve almost 50,000 unique visitors

Let’s have a look at the electricity used by our web server (the “operational” energy use). We have measurements from the server and from the solar charge controller. Comparing both values reveals the inefficiencies in the system. Over a period of roughly one year (from 3 December 2018 to 24 November 2019), the electricity use of our server was 9.53 kilowatt-hours (kWh).

We measured significant losses in the solar PV system due to voltage conversions and charge/discharge losses in the battery. The solar charge controller showed a yearly electricity use of 18.10 kWh, meaning that system efficiency was roughly 50%.

During the period under study, the solar powered website received 865,000 unique visitors. Including all energy losses in the solar set-up, electricity use per unique visitor is then 0.021 watt-hour. One kilowatt-hour of solar generated electricity can thus serve almost 50,000 unique visitors, and one watt-hour of electricity can serve roughly 50 unique visitors. This is all renewable energy and as such there are no direct associated carbon emissions.

Embodied Energy Use & Uptime

The story often ends here when renewable energy is presented as a solution for the growing energy use of the internet. When researchers examine the energy use of data centers, which host the content that is accessible on the internet, they never take into account the energy that is required to build and maintain the infrastructure that powers those data centers.

There is no such omission with a self-hosted website powered by an off-the-grid solar PV installation. The solar panel, the battery, and the solar charge controller are equally essential parts of the installation as the server itself. Consequently, energy use for the mining of the resources and the manufacture of these components – the “embodied energy” – must also be taken into account.


A simple representation of our system. The voltage conversion (between the 12V charge controller and the 5V server) and the battery meter (between the server and the battery) are missing. Illustration: Diego Marmolejo.

Unfortunately, most of this energy comes from fossil fuels, either in the form of diesel (mining the raw materials and transporting the components) or in the form of electricity generated mainly by fossil fuel power plants (most manufacturing processes).

The sizing of battery and solar panel is a compromise between uptime and sustainability

The embodied energy of our configuration is mainly determined by the size of the battery and the solar panel. At the same time, the size of battery and solar panel determine how often the website will be online (the “uptime”). Consequently, the sizing of battery and solar panel is a compromise between uptime and sustainability.

To find the optimal balance, we have run (and keep running) our system with different combinations of solar panels and batteries. Uptime and embodied energy are also determined by the local weather conditions, so the results we present here are only valid for our location (the balcony of the author’s home near Barcelona, Spain).



Different sizes of solar panels and batteries. Illustration: Diego Marmolejo

Uptime and Battery size

Battery storage capacity determines how long the website can run without a supply of solar power. A minimum of energy storage is required to get through the night, while additional storage can compensate for a certain period of low (or no) solar power production during the day. Batteries deteriorate with age, so it’s best to start with more capacity than is actually needed, otherwise the battery needs to be replaced rather quickly.

> 90% Uptime

First, let’s calculate the minimum energy storage needed to keep the website online during the night, provided that the weather is good, the battery is new, and the solar panel is large enough to charge the battery completely. The average power use of our web server during the first year, including all energy losses in the solar installation, was 1.97 watts. During the shortest night of the year (8h50, June 21), we need 17.40 watt-hour of storage capacity, and during the longest night of the year (14h49, December 21), we need 29.19 Wh.


Because lead-acid batteries should not be discharged below half of their capacity, the solar powered server requires a 60 Wh lead-acid battery to get through the shortest nights when solar conditions are optimal (2 x 29.19Wh). For most of the year we ran the system with a slightly larger energy storage (up to 86.4 Wh) and a 50W solar panel, and achieved the above mentioned uptime of 95-98%. [1]

100% Uptime

A larger battery would keep the website running even during longer periods of bad weather, again provided that the solar panel is large enough to charge the battery completely. To compensate for each day of very bad weather (no significant power production), we need 47.28 watt-hour (24h x 1.97 watts) of storage capacity.

From 1 December 2019 to 12 January 2020, we combined the 50 W solar panel with a 168 watt-hour battery, which has a practical storage capacity of 84 watt-hour. This is enough storage to keep the website running for two nights and a day. Even though we tested this configuration during the darkest period of the year, we had relatively nice weather and achieved an uptime of 100%.

However, to assure an uptime of 100% over a period of years would require more energy storage. To keep the website online during four days of low or no power production, we would need a 440 watt-hour lead-acid battery – the size of a car battery. We include this configuration to represent the conventional approach to off-grid solar power.

< 90% Uptime

We also made calculations for batteries that aren’t large enough to get the website through the shortest night of the year: 48 Wh, 24 Wh, and 15.6 Wh (with practical storage capacities of 24 Wh, 12 Wh, and 7.8 Wh, respectively). The latter is the smallest lead-acid battery commercially available.

A website that goes off-line in evening could be an interesting option for a local online publication with low anticipated traffic after midnight.

If the weather is good, the 48 Wh lead-acid battery will keep the server running during the night from March to September. The 24 Wh lead acid-battery can keep the website online for a maximum of 6 hours, meaning that the server will go off-line each night of the year, although at different hours depending on the season.

Finally, the 15.6 Wh battery keeps the website online for only four hours when there’s no solar power. Even if the weather is good, the server will stop working around 1 am in summer and around 9 pm in winter. The maximum uptime for the smallest battery would be around 50%, and in practice it will be lower due to clouds and rain.

A website that goes off-line in evening could be an interesting option for a local online publication with low anticipated traffic after midnight. However, since Low-tech Magazine’s readership is almost equally divided between Europe and the USA this is not an attractive option. If the website goes down every night, our American readers could only access it during the morning.

Uptime and Solar Panel Size

The uptime of the solar powered website is not only determined by the battery, but also by the solar panel, especially in relation to bad weather. The larger the solar panel, the quicker it will charge the battery and fewer hours of sun will be needed to get the website through the night. For example, with the 50 W solar panel, one to two hours of full sun are sufficient to completely charge any of the batteries (except for the car battery).


Different sizes of solar panels. Illustration: Diego Marmolejo.

Replace the 50 W solar panel by a 10 W solar panel, however, and the system needs at least 5.5 hours to charge the 86.4 Wh battery in optimal conditions (2 W to operate the server, 8 W to charge the battery). If the 10W solar panel is combined with a larger, 168 Wh lead-acid battery, it needs 10.5 hours of full sun to charge the battery completely, which is only possible from February to November.

A larger solar panel increases the chances that the website remains online even when weather conditions are not optimal.

A larger solar panel is equally advantageous during cloudy weather. Clouds can lower solar energy production to anywhere between 0 and 90% of maximum capacity, depending on the thickness of cloud cover. If a 50 watt solar panel produces just 10% of its maximum capacity (5W), that’s still enough to run the server (2W) and charge the battery (3W).

However, if a 10 W solar panel only produces 10% of its capacity, that’s just enough to power the server, and the battery won’t be charged. We ran the website on a 10 W panel from 12 to 21 January 2020, and it quickly went down when the weather was not optimal. We are now powering the website with a 30W solar panel (and a 168 Wh battery).


A 5 W solar panel – the smallest 12V solar panel commercially available – is the absolute minimum required to run a solar powered website. However, only under optimal conditions will it be able to power the server (2W) and charge the battery (3W), and it could only keep the website running through the night if the day is long enough. Because solar panels rarely generate their maximum power capacity, this would result in a website that is online only while the sun shines.

Even though the combination of a small solar panel and large battery can have the same embodied energy as the combination of a large solar panel and a small battery, the system each creates will have very different characteristics. In general, it’s best to opt for a larger solar panel and a smaller battery, because this combination increases the life expectancy of the battery – lead-acid batteries need to be fully charged from time to time or they lose storage capacity.

Embodied Energy for Different Sizes of Batteries and Solar Panels

It takes 1.03 megajoule (MJ) to produce 1 watt-hour of lead-acid battery capacity [2], and 3,514 MJ of energy to produce one m2 of solar panel. [3] In the table below, we present the embodied energy for different sizes of batteries and solar panels and then calculate the embodied energy per year, based on a life expectancy of 5 years for batteries and 25 years for solar panels. The values are converted to kilowatt-hours per year and refer to primary energy, not electricity.

A solar powered website also needs a charge controller and of course a web server. The embodied energy for these components remains the same no matter the size of solar panel or battery. The embodied energy per year is based on a life expectancy of 10 years. [4][5]

Embodied-energy-different-components Other-components

We now have all data to calculate the total embodied energy for each combination of solar panels and batteries. The results are presented in the table below. The embodied energy varies by a factor of five depending on the configuration: from 10.92 kWh primary energy per year for the combination of the smallest solar panel (5W) with the smallest battery (15.6 Wh) to 50.46 kWh primary energy per year for the combination of the largest solar panel (50 W) with the largest battery (440Wh).

Embodied-energy-per-year-for-different-solar-setups Expected-uptimes-by-battery-type
If we divide these results by the number of unique visitors per year (865,000), we obtain the embodied energy use per unique visitor to our website. For our original configuration with 95-98% uptime (50W solar panel, 86.4Wh battery), primary energy use per unique visitor is 0.03 Wh. This result would be pretty similar for the other configurations with a lower uptime, because although the embodied energy is lower, so is the number of unique visitors.

Carbon Emissions: How Sustainable is the Solar Powered Website?

Now that we have calculated the embodied energy of different configurations, we can calculate the carbon emissions. We can’t compare the environmental footprint of the solar powered website with that of the old website, because it is hosted elsewhere and we can’t measure its energy use. What we can compare is the solar powered website with a similar self-hosted configuration that is run on grid power. This allows us to assess the (un)sustainability of running the website on solar power.

Life cycle analyses of solar panels are not very useful for working out the CO2-emissions of our components because they work on the assumption that all energy produced by the panels is used. This is not necessarily true in our case: the larger solar panels waste a lot of solar power in optimal weather conditions.

Hosting the solar powered Low-tech Magazine for a year has produced as much emissions as an average car driving a distance of 50 km.

We therefore take another approach: we convert the embodied energy of our components to litres of oil (1 litre of oil is 10 kWh of primary energy) and calculate the result based on the CO2-emissions of oil (1 litre of oil produces 3 kg of greenhouse gasses, including mining and refining it). This takes into account that most solar panels and batteries are now produced in China – where the power grid is three times as carbon-intensive and 50% less energy efficient than in Europe. [6]

This means that fossil fuel use associated with running the solar powered Low-tech Magazine during the first year (50W panel, 86.4 Wh battery) corresponds to 3 litres of oil and 9 kg of carbon emissions – as much as an average European car driving a distance of 50 km. Below are the results for the other configurations:


Comparison with Carbon Intensity of Spanish Power Grid

Now let’s calculate the hypothetical CO2-emissions from running our self-hosted web server on grid power instead of solar power. CO2-emissions in this case depend on the Spanish power grid, which happens to be one of the least carbon intensive in Europe due to its high share of renewable and nuclear energy (respectively 36.8% and 22% in 2019).

Last year, the carbon intensity of the Spanish power grid decreased to 162 g of CO2 per kWh of electricity. For comparison, the average carbon intensity in Europe is around 300g per kWh of electricity, while the carbon intensity of the US and Chinese power grid are respectively above 400g and 900g of CO2 per kWh of electricity.

If we just look at the operational energy use of our server, which was 9.53 kWh of electricity during the first year, running it on the Spanish power grid would have produced 1.54 kg of CO2-emissions, compared to 3 – 9 kg in our tested configurations. This seems to indicate that our solar powered server is a bad idea, because even the smallest solar panel with the smallest battery generates more carbon emissions than grid power.

When the carbon intensity of the power grid is measured, the embodied energy of the renewable power infrastructure is taken to be zero.

However, we’re comparing apples to oranges. We have calculated our emissions based on the embodied energy of our installation. When the carbon intensity of the Spanish power grid is measured, the embodied energy of the renewable power infrastructure is taken to be zero. If we calculated our carbon intensity in the same way, of course it would be zero, too.

Ignoring the embodied carbon emissions of the power infrastructure is reasonable when the grid is powered by fossil fuel power plants, because the carbon emissions to build that infrastructure are very small compared to the carbon emissions of the fuel that is burned. However, the reverse is true of renewable power sources, where operational carbon emissions are almost zero but carbon is emitted during the production of the power plants themselves.

To make a fair comparison with our solar powered server, the calculation of the carbon intensity of the Spanish power grid should take into account the emissions from the building and maintaining of the power plants, the transmission lines, and – should fossil fuel power plants eventually disappear – the energy storage. Of course, ultimately, the embodied energy of all these components would depend on the chosen uptime.

Possible Improvements

There are many ways in which the sustainability of our solar powered website could be improved while maintaining our present uptime. Producing solar panels and batteries using electricity from the Spanish grid would have the largest impact in terms of carbon emissions, because the carbon footprint of our configuration would be roughly 5 times lower than it is now.


What we can do ourselves is lower the operational energy use of the server and improve the system efficiency of the solar PV installation. Both would allow us to run the server with a smaller battery and solar panel, thereby reducing embodied energy. We could also switch to another type of energy storage or even another type of energy source.


We already made some changes that have resulted in a lower operational energy use of the server. For example, we discovered that more than half of total data traffic on our server (6.63 of 11.16 TB) was caused by a single broken RSS implementation that pulled our feed every couple of minutes.

A difference in power use of 0.19 watts adds up to 4.56 watt-hour over the course of 24 hours, which means that the website can stay online for more than 2.5 hours longer.

Fixing this as well as some other changes lowered the power use of the server (excluding energy losses) from 1.14 watts to about 0.95 watts. The gain may seem small, but a difference in power use of 0.19 watts adds up to 4.56 watt-hour over the course of 24 hours, which means that the website can stay online for more than 2.5 hours longer.

System Efficiency

System efficiency was only 50% during the first year. Energy losses were experienced during charging and discharging of the battery (22%), as well as in the voltage conversion from 12V (solar PV system) to 5V (USB connection), where the losses add up to 28%. The initial voltage converter we built was pretty suboptimum (our solar charge controller doesn’t have a built-in USB-connection), so we could build a better one, or switch to a 5V solar PV set-up.

Energy Storage

To increase the efficiency of the energy storage, we could replace the lead-acid batteries with more expensive lithium-ion batteries, which have lower charge/discharge losses (<10%) and lower embodied energy. More likely is that we eventually switch to a more poetic small-scale compressed air energy storage system (CAES). Although low pressure CAES systems have similar efficiency to lead-acid batteries, they have much lower embodied energy due to their long life expectancy (decades instead of years).


Energy Source

Another way to lower the embodied energy is to switch renewable energy source. Solar PV power has high embodied energy compared to alternatives such as wind, water, or human power. These power sources could be harvested with little more than a generator and a voltage regulator – as the rest of the power plant could be built out of wood. Furthermore, a water-powered website wouldn’t require high-tech energy storage. If you’re in a cold climate, you could even operate a website on the heat of a wood stove, using a thermo-electric generator.

Solar Tracker

People who have a good supply of wind or water power could build a system with lower embodied energy than ours. However, unless the author starts powering his website by hand or foot, we’re pretty much stuck with solar power. The biggest improvement we could make is to add a solar tracker that makes the panel follow the sun, which could increase electricity generation by as much as 30%, and allow us to obtain a better uptime with a smaller panel.

Let’s Scale Things Up !

A final way to improve the sustainability of our system would be to scale it up: run more websites on a server, and run more (and larger) servers on a solar PV system. This set-up would have much lower embodied energy than an oversized system for each website alone.


Illustration: Diego Marmolejo.

Solar Webhosting Company

If we were to fill the author’s balcony with solar panels and start a solar powered webhosting company, the embodied energy per unique visitor would decrease significantly. We would need only one server for multiple websites, and only one solar charge controller for multiple solar panels. Voltage conversion would be more energy efficient, and both solar and battery power could be shared by all websites, which brings economies of scale.

Of course, this is the very concept of the data center, and although we have no ambition to start such a business, others could take this idea forward: towards a data center that is run just as efficiently as any other data center today, but which is powered by renewables and goes off-line when the weather is bad.

Add More Websites

We found that the capacity of our server is large enough to host more websites, so we already took a small step towards economies of scale by moving the Spanish and French versions of Low-tech Magazine to the solar powered server (as well as some other translations).

Although this move will increase our operational energy use and potentially also our embodied energy use, we also eliminate other websites that are or were hosted elsewhere. We also have to keep in mind that the number of unique visitors to Low-tech Magazine may grow in the future, so we need to become more energy efficient just to maintain our environmental footprint.

Combine Server and Lighting

Another way to achieve economies of scale would give a whole new twist to the idea. The solar powered server is part of the author’s household, which is also partly powered by off-grid solar energy. We could test different sizes of batteries and solar panels – simply swapping components between solar installations.

When we were running the server on the 50 W panel, the author was running the lights in the living room on a 10W panel – and was often left sitting in the dark. When we were running the server on the 10 W panel, it was the other way around: there was more light in the household, at the expense of a lower server uptime.

If the weather gets bad, the author could decide not to use the lights and keep the server online – or the other way around

Let’s say we run both the lights and the server on one solar PV system. It would lower the embodied energy if both systems are considered, because only one solar charge controller would be needed. Furthermore, it could result in a much smaller battery and solar panel (compared to two separate systems), because if the weather gets bad, the author could decide not to use the lights and keep the server online – or the other way around. This flexibility is not available now, because the server is the only load and its power use cannot be easily manipulated.

Energy Use in the Network

As far as we know, ours is the first life cycle analysis of a website that runs entirely on renewable energy and includes the embodied energy of its power and energy storage infrastructure. However, this is not, of course, the total energy use associated with this website.


There’s also the operational and embodied energy of the network infrastructure (which includes our router, the internet backbone, and the mobile phone network), and the operational and embodied energy of the devices that our visitors use to access our website: smartphones, tablets, laptops, desktops. Some of these have low operational energy use, but they all have very limited lifespans and thus high embodied energy.

Energy use in the network is directly related to the bit rate of the data traffic that runs through it, so our lightweight website is just as efficient in the communication network as it is on our server. However, we have very little influence over which devices people use to access our website, and the direct advantage of our design is much smaller here than in the network. For example, our website has the potential to increase the life expectancy of computers, because it’s light enough to be accessed with very old machines. Unfortunately, our website alone will not make people use their computers for longer.

Both the network infrastructure and the end-use devices could be re-imagined along the lines of the solar powered website.

That said, both the network infrastructure and the end-use devices could be re-imagined along the lines of the solar powered website – downscaled and powered by renewable energy sources with limited energy storage. Parts of the network infrastructure could go off-line if the local weather is bad, and your e-mail may be temporarily stored in a rainstorm 3.000 km away. This type of network infrastructure actually exists in some countries, and those networks partly inspired this solar powered website. The end-use devices could have low energy use and long life expectancy.

Because the total energy use of the internet is usually measured to be roughly equally distributed over servers, network, and end-use devices (all including the manufacturing of the devices), we can make a rough estimate of the total energy use of this website throughout a re-imagined internet. For our original set-up with 95.2% uptime, this would be 87.6 kWh of primary energy, which corresponds to 9 litres of oil and 27 kg of CO2. The improvements we outlined earlier could bring these numbers further down, because in this calculation the whole internet is powered by oversized solar PV systems on balconies.

Kris De Decker, Roel Roscam Abbing, Marie Otsuka

Thanks to Kathy Vanhout, Adriana Parra and Gauthier Roussilhe.

Proofread by Alice Essam.

* Subscribe to our newsletter
* Support Low-tech Magazine via Paypal or Patreon.
* Buy the printed website.
* Read more about the solar powered website


[1] The storage capacity for our original set-up is an estimation. In reality, during this period we have run the solar powered server on a 24 Wh (3.7V, 6.6A) LiPo-battery, and placed a very old 84.4 watt-hour lead-acid battery in between the LiPo and the solar charge controller to make both systems compatible. The cut-off voltage of the lead-acid battery was set very high in summer (meaning that the system was running only on the LiPo) but lower in winter (so that part of the lead-acid battery provided a share of the energy storage). This complicated set-up was entirely due to the fact that we could only measure the storage capacity of the LiPo battery, which we needed to display our online battery meter. In November 2019 we developed our own lead-acid battery meter, which made it possible to eliminate the LiPo from our configuration.

[2] “Energy Analysis of Batteries in Photovoltaic systems. Part one (Performance and energy requirements)” and “Part two (Energy Return Factors and Overall Battery Efficiencies)“. Energy Conversion and Management 46, 2005

[3] Zhong, Shan, Pratiksha Rakhe, and Joshua M. Pearce. “Energy payback time of a solar photovoltaic powered waste plastic recyclebot system.” Recycling 2.2 (2017): 10.

[4] There is little useful research into the embodied energy of solar charge controllers. Most studies focus on large solar PV systems, in which the charge controller’s embodied energy is negligible. The most useful result we found was a value of 1 MJ/W, estimated over the size of the controller: Kim, Bunthern, et al. “Life cycle assessment for a solar energy system based on reuse components for developing countries.” Journal of cleaner production 208 (2019): 1459-1468. For a capacity of 120W, this comes down to 120 MJ or 33.33 kWh. For the life expectancy, we found values of 7 years and 12.5 years: same reference, and Kim, Bunthern, et al. “Second life of power supply unit as charge controller in PV system and environmental benefit assessment.” IECON 2016-42nd Annual Conference of the IEEE Industrial Electronics Society. IEEE, 2016. We decided to make the calculation based on a life expectancy of 10 years.

[5] There is no research about the embodied energy of our server. We calculated the embodied energy on the basis of a life cycle analysis of a smartphone: Ercan, Mine & Malmodin, Jens & Bergmark, Pernilla & Kimfalk, Emma & Nilsson, Ellinor. (2016). [Life Cycle Assessment of a Smartphone]( 10.2991/ict4s-16.2016.15. We have no idea of the expected lifetime of the server, but since our Olimex is aimed at industrial use (unlike the Raspberry Pi), we assume a life expectancy of 10 years, just like the charge controller.

[6] De Decker, Kris. “How sustainable is solar PV power?”, Low-tech Magazine, May 2015.


Source: How Sustainable is a Solar Powered Website?

Social Media

Signs You’re Following A Fake Twitter Account

The challenge of dealing with fake accounts that spread disinformation is probably one of the greatest challenges facing the large social media companies.

Most methods used to detect fake Twitter accounts (like the excellent Botometer) assume that they are working at scale and are automated, and for large bot operations this is almost always the case. But what about fake accounts that that have a real person behind them? Not every fake account is a bot, but there are still plenty of accounts that purport to be real people in order to add credibility to their message.

Earlier this week The Third Man asked about a Twitter account that he was slightly suspicious of. The account purports to be that of Max Steinberg, a Mancunian (i.e. from Manchester, UK) lawyer who now lives in Brooklyn, New York. He’s passionate about left-wing political causes and spends an awful lot of time amplifying content from his favourite political accounts – but is he even a real person? In the rest of the post I’ll look at a few indicators that suggest he probably isn’t.

Profile Picture – This Person Does Not Exist

This Person Does Not Exist is a website that uses AI to generate random but realistic looking faces. It’s a great tool and has become a popular way of generating fake profiles for sock puppet accounts, but it is not without its limitations. There are a number common flaws and features in TPDNE-generated images that means it’s possible to spot them. I’m fairly confident that “Max Steinberg” has a few of these flaws too, so let’s see if we can prove this is the case.

Eye and Mouth Alignment

A common feature of TPDNE images is that the eyes and mouth of the person are always in exactly the same place in the picture. The eyes are always the same distance apart and centred in the same place. The mouth is always about one quarter of the way up from the bottom of the image and is also always centred. This occurs regardless of the angle of their head and can sometimes make for quite unusual looking faces. Let’s see if Max’s face follows the same pattern:

Interesting! By itself this may not be conclusive, but there are some other features that might indicate that Max’s origins are in an AI program, not Manchester.

Ear Asymmetry

We’re all unique and special snowflakes, and none of us have perfectly aligned facial features – but there’s still something odd about Max’s ears. In the photo he is staring at us directly but his ears are not even remotely symmetrical:

From this angle Max’s ears should look roughly the same size and shape but his right ear is a very different shape and size to his left one. Perhaps I’m being too harsh, and this is just a genetic trait – but if it is, then ear asymmetry is a genetic trait that a lot of the TPDNE family all have in common:

Hmmm. On the balance of probabilities I’m going to say that Max’s ears look like that because he’s AI generated. (I’m sure he will correct me if I’m wrong.)

Unusual Eyes And Teeth

There’s also something odd about his eyes. Look at his right pupil:

His left pupil looks normal, but the pupil in his right eye is nearly twice as wide as his left one. Another example of genetic bad luck? It seems unlikely. TPDNE has improved over times and it now renders eyes and mouths in a much more realistic way, but sometimes there are still tiny flaws that betray the origins of the image.

You can also see a similar problem with his teeth:

His two front teeth (red box) are different shapes and sizes. The teeth on the right side of his mouth have not been rendered properly at all. An analysis of Max’s facial features suggests he has more in common with an AI-generated fake than with a real person.

There are other giveaways too. TPDNE only creates a single image of a person, so if the person truly does not exist, we should never be able to find any image of them other than the fake one where they are staring directly at the camera. Sure enough, there are no other photos of Max in his Twitter account, no reverse image hits for his face, and no matches when searching for images of lawyers called Max Steinberg.

Twitter Habits

Max’s profile picture is likely fake, and it isn’t the only thing that suggests his account isn’t quite what it claims to be. There’s a lot of detail in his bio that we can try and verify. Let’s start with the location:

Although originally from Manchester, he claims to be based in Brooklyn, New York. If this is right then we should be able to see this reflected in Max’s Twitter activity. offers some insights into a user’s tweeting habits that might tell us a little more about Max’s origins.

The first sign is the age of the account and the volume of activity. This is the account data as it appeared on the evening of 11th March:

The account less than a week old but has already posted 766 tweets – that’s a huge volume in such a short period of time. It’s also evident from the content that this nearly-new Twitter account is dedicated entirely to amplifying political messages:

A new account with a fake profile picture that shares huge amounts of political content in a short period? Hmmm.

There’s also something odd about the times that Max tweets at. This is the overview of his activity times as of  10th March 2020:

These times are UTC, so this shows that Max usually tweets at either end of the working day (with a little activity at lunchtime) – but these times only make sense if Max is based in Europe (which varies between UTC and UTC +2). If Max really was based in New York (which is at UTC -4) then his Twitter activity would peak between 4am and 6am, and again between 1pm and 3pm, which would be unusual for someone who lived there. This is not conclusive, but it does indicate that Max is still probably a lot closer to Manchester than he is to New York.

As I was writing this up I noticed that in the 24 hours since I started there had been a big spike in Max’s activity late at night:

This seemed a little odd but a quick overview of his timeline provided some explanation. The spike coincided with the close of polls for the 10th March Democratic primaries, which prompted a flurry of activity from Max.

No Backstory

I’m still not convinced on Max’s bio details. Fortunately he’s a lawyer, so he’ll be listed in one of the public registries for qualified lawyers in either the UK or in New York, right? In the UK, the Solictor’s Regulation Authority maintains a list of registered lawyers. Here’s the data they held on Max:


So maybe he’s registered in New York? We can verify whether he is or not because fortunately New York maintains a public database of registered attorneys too. They have not one but two Max Steinbergs registered – perhaps I’ve judged Max a little too harshly?

Nope. The register also tells us that both Max Steinbergs suffer from a serious medical condition that means they can no longer practise law, let alone run a Twitter account.

We’ve seen that Max is almost certainly not a real person. He has a fake profile picture, a fake back story, probably doesn’t live where he says he does, and seems to have been created solely for the purpose of creating and amplifying political Twitter content.



Source: Signs You’re Following A Fake Twitter Account


20+ Machine Learning Datasets & Project Ideas

By Shivashish Thakur

To Build a perfect model, you need a large amount of data. But finding the right dataset for your machine learning and data science project is sometimes quite a challenging task. There are many organizations, researchers, and individuals who’ve shared their work, and we will use their datasets to build our project.

So in this article, we are going to discuss 20+ Machine learning and Data Science dataset and project ideas that you can use for practicing and upgrading your skills.


1. Enron Email Dataset

The Enron Dataset is popular in natural language processing. It has more than 500K emails of over 150 users. The size of the data is around 432Mb. Out of 150 users, most of the users are the senior management of Enron.

Data Link: Enron email dataset

Project Idea: Using k-means clustering, you can build a model to detect fraudulent activities. K-means clustering is an unsupervised Machine learning algorithm. It separates the observations into k number of clusters based on the similar patterns in the data.


2. Chatbot Intents Dataset

The dataset for a chatbot is a JSON file that has disparate tags like goodbye, greetings, pharmacy_search, hospital_search, etc. Every tag has a list of patterns that a user can ask, and the chatbot will respond according to that pattern. The dataset is perfect for understanding how chatbot data works.

Data Link: Intents JSON Dataset

Project Idea: You can build a chatbot or understand the working of a chatbot by twisting and expanding the data with your observations. To build a Chatbot of your own, you need to have a good knowledge of Natural language processing concepts.

Source Code: Chatbot Project in Python


3. Flickr 30k Dataset

The Flickr 30k dataset has over 30,000 images, and each image is labeled with different captions. This dataset is used to build an image caption generator. And this dataset is an upgraded version of Flickr 8k used to build more accurate models.

Data Link: Flickr image dataset

Project Idea: You can build a CNN model that is great for analysing and extracting features from the image and generate a english sentence that describes the image that is called Caption.


4. Parkinson Dataset

Parkinson’s is a disease that can cause a nervous system disorder and affects the movement. Parkinson dataset contains biomedical measurements, 195 records of people with 23 different attributes. This data is used to differentiate healthy people and people with Parkinson’s disease.

Data Link: Parkinson dataset

Project Idea: You can build a model that can be used to differentiate healthy people from people having Parkinson’s disease. The algorithm that is useful for this purpose is XGboost, which stands for extreme gradient boosting, and it is based on decision trees.

Source Code: ML Project on Detecting Parkinson’s Disease


5. Iris Dataset

The iris dataset is a beginner-friendly dataset that has information about the flower petal and sepal sizes. This dataset has 3 classes with 50 instances in every class, so only contains 150 rows with 4 columns.

Data Link: Iris dataset

Project Idea: Classification is the task of separating items into their corresponding class. You can implement a machine learning classification or regression model on the dataset.


6. ImageNet dataset

ImageNet is a large image database that is organized according to the wordnet hierarchy. It has over 100,000 phrases and an average of 1000 images per phrase. The size exceeds 150 GB. It is suitable for image recognition, face recognition, object detection, etc. It also hosts a challenging competition named ILSVRC for people to build more and more accurate models.

Data Link: Imagenet Dataset

Project Idea: To implement image classification on this huge database and recognize objects. CNN model (Convolutional neural networks) are necessary for this project to get accurate results.


7. Mall Customers Dataset

The Mall customers dataset holds the details about people visiting the mall. The dataset has an age, customer id, gender, annual income, and spending score. It gains insights from the data and divides the customers into different groups based on their behaviors.

Dataset Link: mall customers dataset

Project Idea: Segment the customers based on their gender, age, interest. It is useful in customized marketing. Customer segmentation is an important practice of dividing customers based on individual groups that are similar.

Source Code: Customer segmentation with Machine learning.


8. Google Trends Data Portal

Google trends data can be used to examine and analyze the data visually. You can also download the dataset into CSV files with a simple click. We can find out what’s trending and what people are searching for.

Data Link: Google trends datasets


9. The Boston Housing Dataset

This is a popular dataset used in pattern recognition. It contains information about the different houses in Boston based on crime rate, tax, number of rooms, etc. It has 506 rows and 14 different variables in columns. You can use this dataset to predict house prices.

Data Link: Boston dataset

Project Idea: Predict the housing prices of a new house using linear regression. Linear regression is used to predict values of unknown input when the data has some linear relationship between input and output variables.


10. Uber Pickups Dataset

The dataset has information about 4.5 million Uber pickups in New York City from April 2014 to September 2014 and 14 million more from January 2015 to June 2015. Users can perform data analysis and gather insights from the data.

Data Link: Uber pickups dataset

Project Idea: To analyze the data of the customer rides and visualize the data to find insights that can help improve business. Data analysis and visualization is an important part of data science. They are used to gather insights from the data, and with visualization, you can get quick information from the data.


11. Recommender Systems Dataset

This is a portal to a collection of rich datasets that were used in lab research projects at UCSD. It contains various datasets from popular websites like Goodreads book reviews, Amazon product reviews, bartending data, data from social media, etc that are used in building a recommender system.

Data Link: Recommender systems dataset

Project Idea: Build a product recommendation system like Amazon. A recommendation system can suggest your products, movies, etc. based on your interests and the things you like and have used earlier.

Source Code: Movie Recommendation System Project


12. UCI Spambase Dataset

Classifying emails as spam or non-spam is a very common and useful task. The dataset contains 4601 emails and 57 meta-information about the emails. You can build models to filter out the spam.

Data Link: UCI spambase dataset

Project Idea: You can build a model that can identify your emails as spam or non-spam.


13. GTSRB (German traffic sign recognition benchmark) Dataset

The GTSRB dataset contains around 50,000 images of traffic signs belonging to 43 different classes and contains information on the bounding box of each sign. The dataset is used for multiclass classification.

Data Link: GTSRB dataset

Artificial Intelligence Project Idea: Build a model using a deep learning framework that classifies traffic signs and also recognizes the bounding box of signs. The traffic sign classification is also useful in autonomous vehicles for identifying signs and then taking appropriate actions.

Source Code: Traffic Signs Recognition Python Project


14. Cityscapes Dataset

This is an open-source dataset for Computer Vision projects. It contains high-quality pixel-level annotations of video sequences taken in 50 different city streets. The dataset is useful in semantic segmentation and training deep neural networks to understand the urban scene.

Data Link: Cityscapes dataset

Project Idea: To perform image segmentation and detect different objects from a video on the road. Image segmentation is the process of digitally partitioning an image into various different categories like cars, buses, people, trees, roads, etc.


15. Kinetics Dataset

There are three different datasets for Kinetics: Kinetics 400, Kinetics 600, and Kinetics 700 dataset. This is a large scale dataset that contains a URL link to around 6.5 million high-quality videos.

Data Link: Kinetics dataset

Project Idea: Build a human action recognition model and detect the action of a human. Human action recognition is recognized by a series of observations.


16. IMDB-Wiki dataset

The IMDB-Wiki dataset is one of the largest open-source datasets for face images with labeled gender and age. The images are collected from IMDB and Wikipedia. It has 5 million-plus labeled images.

Data Link: IMDB wiki dataset

Project Idea: Make a model that will detect faces and predict their gender and age. You can have categories in different ranges like 0-10, 10-20, 30-40, 50-60, etc.


17. Color Detection Dataset

The dataset contains a CSV file that has 865 color names with their corresponding RGB (red, green, and blue) values of the color. It also has the hexadecimal value of the color.

Data Link: Color Detection Dataset

Project Idea: The color dataset can use used to make a color detection app in which we can have an interface to pick a color from the image and the app will display the name of the color.

Source Code: Color Detection Python Project


18. Urban Sound 8K dataset

The urban sound dataset contains 8732 urban sounds from 10 classes like an air conditioner, dog bark, drilling, siren, street music, etc. The dataset is popular for urban sound classification problems.

Data Link: Urban Sound 8K dataset

Project Idea: We can build a sound classification system to detect the type of urban sound playing in the background. This will help you get started with audio data and understand how to work with unstructured data.


19. Librispeech Dataset

This dataset contains a large number of English speeches that are derived from the LibriVox project. It has 1000 hours of English-read speech in various accents. It is used for speech recognition projects.

Data Link: Librispeech dataset

Project Idea: Build a speech recognition model to detect what is being said and convert it into text. The objective of speech recognition is to automatically identify what is being said in the audio.


20. Breast Histopathology Images Dataset

This dataset contains 2,77,524 images of size 50×50 extracted from 162 mount slide images of breast cancer specimens scanned at 40x. There are 1,98,738 negative tests and 78,786 positive tests with IDC.

Data Link: Breast histopathology dataset

Project Idea: To build a model that can classify breast cancer. You build an image classification model with Convolutional neural networks.

Source Code: Breast Cancer Classification Python Project


21. Youtube 8M Dataset

The youtube 8M dataset is a large scale labeled video dataset that has 6.1 million Youtube video ids, 350,000 hours of video, 2.6 billion audio/visual features, 3862 classes, and 3 avg labels per video. It is used for video classification purposes.

Data Link: Youtube 8M

Project Idea: Video classification can be done by using the dataset, and the model can describe what video is about. A video takes a series of inputs to classify in which category the video belongs.



In this article, we saw more than 20 machine learning datasets that you can use to practice machine learning or data science. Creating a dataset on your own is expensive, so we can use other people’s datasets to get our work done. But we should read the documents of the dataset carefully because some datasets are free, while for some datasets, you have to give credit to the owner as stated by them.


Bio: Shivashish Thaku is an Analyst and technical content writer. He is a technology freak who loves to write about the latest cutting edge technologies that are transforming the world. He is also a sports fan who loves to play and watch football.


Source: 20+ Machine Learning Datasets & Project Ideas


Free AI, ML, Deep Learning Video Lectures

List of Top Artificial Intelligence, Machine Learning, Deep Learning Video Lectures:

Master Machine Learning fundamentals in 5 hands-on courses from the University of Washington. Enroll today! [Advertisement]

If you want to suggest any resource then please email us at


Source: Free AI, ML, Deep Learning Video Lectures

Machine Learning

Top 10 Data Science Algorithms You Must Know About

The implementation of Data Science to any problem requires a set of skills. Machine Learning is an integral part of this skill set. For doing Data Science, you must know the various Machine Learning algorithms used for solving different types of problems, as a single algorithm cannot be the best for all types of use cases. These algorithms find an application in various tasks like prediction, classification, clustering, etc from the dataset under consideration. In this article, we will see a brief introduction to the top Data Science algorithms.

Top Data Science Algorithms

The most popular Machine Learning algorithms used by the Data Scientists are:

1. Linear Regressiondata science algorithm - linear regression

Linear regression method is used for predicting the value of the dependent variable by using the values of the independent variable. The linear regression model is suitable for predicting the value of a continuous quantity.


The linear regression model represents the relationship between the input variables (x) and the output variable (y) of a dataset in terms of a line given by the equation,

y = b0 + b1x


  • y is the dependent variable whose value we want to predict.
  • x is the independent variable whose values are used for predicting the dependent variable.
  • b0 and b1 are constants in which b0 is the Y-intercept and b1 is the slope.

The main aim of this method is to find the value of b0 and b1 to find the best fit line that will be covering or will be nearest to most of the data points.

2. Logistic Regression

Linear Regression is always used for representing the relationship between some continuous values. However, contrary to this Logistic Regression works on discrete values. Logistic regression finds the most common application in solving binary classification problems, that is, when there are only two possibilities of an event, either the event will occur or it will not occur (0 or 1).

Thus, in Logistic Regression, we convert the predicted values into such values that lie in the range of 0 to 1 by using a non-linear transform function which is called a logistic function. The logistic function results in an S-shaped curve and is therefore also called a Sigmoid function given by the equation,

𝝈(x) = 1/1+e^-xdata science algorithm - logistic regressions

The equation of Logistic Regression is,

P(x) = e^(b0+b1x)/1 + e^(b0+b1x)

Where b0 and b1 are coefficients and the goal of Logistic Regression is to find the value of these coefficients.

3. Decision Trees

Decision trees help in solving both classification and prediction problems. It makes it easy to understand the data for better accuracy of the predictions. Each node of the Decision tree represents a feature or an attribute, each link represents a decision and each leaf node holds a class label, that is, the outcome.

The drawback of decision trees is that it suffers from the problem of overfitting. Basically these two Data Science algorithms are most commonly used for implementing the Decision trees.

  • ID3 ( Iterative Dichotomiser 3) Algorithm

This algorithm uses entropy and information gain as the decision metric.

  • Cart ( Classification and Regression Tree) Algorithm

This algorithm uses the Gini index as the decision metric. The below image will help you to understand things better.cart data science algorithms

4. Naive Bayes

The Naive Bayes algorithm helps in building predictive models. We use this Data Science algorithm when we want to calculate the probability of the occurrence of an event in the future. Here, we have prior knowledge that another event has already occurred.

The Naive Bayes algorithm works on the assumption that each feature is independent and has an individual contribution to the final prediction. The Naive Bayes theorem is represented by:

P(A|B) = P(B|A) P(A) / P(B)

Where A and B are two events.

  • P(A|B) is the posterior probability i.e the probability of A given that B has already occurred.
  • P(B|A) is the likelihood i.e the probability of B given that A has already occurred.
  • P(A) is the class prior to probability.
  • P(B) is the predictor prior probability.

5. KNN

KNN stands for K-Nearest Neighbours. This Data Science algorithm employs both classification and regression problems. The KNN algorithm considers the complete dataset as the training dataset. After training the model using the KNN algorithm, when we try to predict the outcome of a new data point, the KNN algorithm searches the entire data set for identifying the k most similar or nearest neighbors of that data point. It then predicts the outcome based on these k instances.

For finding the nearest neighbors of a data instance, we can use various distance measures like Euclidean distance, Hamming distance, etc. To better understand, let us consider the following example.KNN data science algorithms

Here we have represented the two classes A and B by the circle and the square respectively. Let us assume the value of k is 3. Now we will first find three data points that are closest to the new data item and enclose them in a dotted circle. Here the three closest points of the new data item belong to class A. Thus, we can say that the new data point will also belong to class A.

Now you all might be thinking that how we assumed k=3? The selection of the value of k is a very critical task. You should take such a value of k that it is neither too small nor too large. Another simpler approach is to take k = √n where n is the number of data points.

Any doubts in TechVidvan’s Data Science algorithms article till now? Ask in the comment section.

6. Support Vector Machine (SVM)

Support Vector Machine or SVM comes under the category of supervised Machine Learning algorithms and finds an application in both classification and regression problems. It is most commonly used for classification of problems and classifies the data points by using a hyperplane.

The first step of this Data Science algorithm involves plotting all the data items as individual points in an n-dimensional graph. Here, n is the number of features and the value of each individual feature is the value of a specific coordinate. Then we find the hyperplane that best separates the two classes for classifying them. Finding the correct hyperplane plays the most important role in classification. The data points which are closest to the separating hyperplane are the support vectors.svm support vectors

Let us consider the following example to understand how you can identify the right hyperplane. The basic principle for selecting the best hyperplane is that you have to choose the hyperplane that separates the two classes very well.svm hyperlanes

In this case, the hyperplane B is classifying the data points very well. Thus, B will be the right hyperplane.hyperlane classifying data points

All three hyperplanes are separating the two classes properly. In such cases, we have to select the hyperplane with the maximum margin. As we can see in the above image, the hyperplane B has the maximum margin therefore it will be the right hyperplane.svm hyperlane with maximum margins

In this case, the hyperplane B has the maximum margin but it is not classifying the two classes accurately. Thus, A will be the right hyperplane.

7. K-Means Clustering

K-means clustering is a type of unsupervised Machine Learning algorithm. Clustering basically means dividing the data set into groups of similar data items called clusters. K means clustering categorizes the data items into k groups with similar data items. For measuring this similarity, we use Euclidean distance which is given by,

D = √(x1-x2)^2 + (y1-y2)^2

K means clustering is iterative in nature. The basic steps followed by the algorithm are as follows:

  • First, we select the value of k which is equal to the number of clusters into which we want to categorize our data. Then we assign the random center values to each of these k clusters. Now we start searching for the nearest data points to the cluster centers by using the Euclidean distance formula.
  • In the next step, we calculate the mean of the data points assigned to each cluster.
  • Again we search for the nearest data points to the newly created centers and assign them to their closest clusters.
  • We should keep repeating the above steps until there is no change in the data points assigned to the k clusters.

8. Principal Component Analysis (PCA)

PCA is basically a technique for performing dimensionality reduction of the datasets with the least effect on the variance of the datasets. This means removing the redundant features but keeping the important ones. To achieve this, PCA transforms the variables of the dataset into a new set of variables. This new set of variables represents the principal components. The most important features of these principal components are:

  • All the PCs are orthogonal (i.e they are at a right angle to each other).
  • They are created in such a way that with the increasing number of components, the amount of variation that it retains starts decreasing. This means the 1st principal component retains the variation to the maximum extent as compared to the original variables.

PCA is basically used for summarizing data. While dealing with a dataset there might be some features related to each other. Thus PCA helps you to reduce such features and make predictions with less number of features without compromising with the accuracy. For example, consider the following diagram in which we have reduced a 3D space to a 2D science algorithm - PCA

9. Neural Networks

Neural Networks are also known as Artificial Neural Networks. Let us understand this by an example.neural networks data science algorithms

Identifying the digits written in the above image is a very easy task for humans. This is because our brain contains millions of neurons that perform complex calculations for identifying any visual easily in no time. But for machines, this is a very difficult task to do.

Neural networks solve this problem by training the machine with a large number of examples. By this, the machine automatically learns from the data for recognizing various digits. Thus we can say that Neural Networks are the Data Science algorithms that work to make the machine identify the various patterns in the same way as a human brain does.

10. Random Forests

Random Forests overcomes the overfitting problem of decision trees and helps in solving both classification and regression problems. It works on the principle of Ensemble learning. The Ensemble learning methods believe that a large number of weak learners can work together for giving high accuracy predictions.

Random Forests work in a much similar way. It considers the prediction of a large number of individual decision trees for giving the final outcome. It calculates the number of votes of predictions of different decision trees and the prediction with the largest number of votes becomes the prediction of the model. Let us understand this by an example.random forests data science algorithms

In the above image, there are two classes labeled as A and B. In this random forest consisting of 7 decision trees, 3 have voted for class A and 4 voted for class B. As class B has received the maximum votes thus the model’s prediction will be class B.


In this article, we have gone through a basic introduction of some of the most popular Data Science algorithms among the Data Scientists. Their are various Data Science tools also which help Data Scientists to handle and analyze large amounts of data. These Data Science tools and algorithms help them to solve various Data Science problems for making better strategies.

I hope you liked TechVidvan’s Data Science algorithms article, do give us a rating on Google.

Happy Learning!!


Source: Top 10 Data Science Algorithms You Must Know About


The Best 25 Datasets for Natural Language Processing

Natural language processing is a massive field of research. With so many areas to explore, it can sometimes be difficult to know where to begin – let alone start searching for data.

With this in mind, we’ve combed the web to create the ultimate collection of free online datasets for NLP. Although it’s impossible to cover every field of interest, we’ve done our best to compile datasets for a broad range of NLP research areas, from sentiment analysis to audio and voice recognition projects. Use it as a starting point for your experiments, or check out our specialized collections of datasets if you already have a project in mind.


Datasets for Sentiment Analysis

Where can I download datasets for sentiment analysis?

Machine learning models for sentiment analysis need to be trained with large, specialized datasets. The following list should hint at some of the ways that you can improve your sentiment analysis algorithm.

Multidomain Sentiment Analysis Dataset: This is a slightly older dataset that features a variety of product reviews taken from Amazon.

IMDB Reviews: Featuring 25,000 movie reviews, this relatively small dataset was compiled primarily for binary sentiment classification use cases.

Stanford Sentiment Treebank: Also built from movie reviews, Stanford’s dataset was designed to train a model to identify sentiment in longer phrases. It contains over 10,000 snippets taken from Rotten Tomatoes.

Sentiment140: This popular dataset contains 160,000 tweets formatted with 6 fields: polarity, ID, tweet date, query, user, and the text. Emoticons have been pre-removed.

Twitter US Airline Sentiment: Scraped in February 2015, these tweets about US airlines are classified as classified as positive, negative, and neutral. Negative tweets have also been categorized by reason for complaint.


Datasets for Text

Where can I download text datasets for natural language processing?

Natural language processing is a massive field of research, but the following list includes a broad range of datasets for different natural language processing tasks, such as voice recognition and chatbots.

20 Newsgroups: This collection of approximately 20,000 documents covers 20 different newsgroups, from baseball to religion.

Reuters News Dataset: The documents in this dataset appeared on Reuters in 1987. They have since been assembled and indexed for use in machine learning.

The WikiQA Corpus: This corpus is a publicly-available collection of question and answer pairs. It was originally assembled for use in research on open-domain question answering.

UCI’s Spambase: Originally created by a team at Hewlett-Packard, this large spam email dataset is useful for developing personalized spam filters.

Yelp Reviews: This open dataset released by Yelp contains more than 5 million reviews.

WordNet: Compiled by researchers at Princeton University, WordNet is essentially a large lexical database of English ‘synsets’, or groups of synonyms that each describe a different, distinct concept.


Audio Speech Datasets for Natural Language Processing

Where can I download audio datasets for natural language processing? 

Audio speech datasets are useful for training natural language processing applications such as virtual assistants, in-car navigation, and any other sound-activated systems.

2000 HUB5 English: This dataset contains transcripts derived from 40 telephone conversations in English. The corresponding speech files are also available through this page.

LibriSpeech: This corpus contains roughly 1,000 hours of English speech, comprised of audiobooks read by multiple speakers. The data is organized by chapters of each book.

Spoken Wikipedia Corpora: Containing hundreds of hours of audio, this corpus is composed of spoken articles from Wikipedia in English, German, and Dutch. Due to the nature of the project, it also contains a diverse set of readers and topics.

Free Spoken Digit Dataset: This is a collection of 1,500 recordings of spoken digits in English.

TIMIT: This data is designed for research in acoustic-phonetic studies and the development of automatic speech recognition systems. It contains recordings of 630 speakers of American English reading ten ‘phonetically rich’ sentences.


Datasets for Natural Language Processing (General)

Where can I download open datasets for natural language processing? 

Still can’t find what you need? Here are a few more datasets for natural language processing tasks.

Enron Dataset: Containing roughly 500,000 messages from the senior management of Enron, this dataset was made as a resource for those looking to improve or understand current email tools.

Amazon Reviews: This dataset contains around 35 million reviews from Amazon spanning a period of 18 years. It includes product and user information, ratings, and the plaintext review.

Google Books Ngrams: A Google Books corpora of n-grams, or ‘fixed size tuples of items’, can be found at this link. The ‘n’ in ‘n-grams’ specifies the number of words or characters in that specific tuple.

Blogger Corpus: Gathered from, this collection of 681,288 blog posts contains over 140 million words. Each blog included here contains at least 200 occurrences of common English words.

Wikipedia Links Data: Containing approximately 13 million documents, this dataset by Google consists of web pages that contain at least one hyperlink pointing to English Wikipedia. Each Wikipedia page is treated as an entity, while the anchor text of the link represents a mention of that entity.

Gutenberg eBooks List: This annotated list of ebooks from Project Gutenberg contains basic information about each eBook, organized by year.

Hansards Text Chunks of Canadian Parliament: This corpus contains 1.3 million pairs of aligned text chunks from the records of the 36th Canadian Parliament.

Jeopardy: The archive linked here contains more than 200,000 questions and answers from the quiz show Jeopardy. Each data point also contains a range of other information, including the category of the question, show number, and air date.

SMS Spam Collection in English: This dataset consists of 5,574 English SMS messages that have been tagged as either legitimate or spam. 425 of the texts are spam messages that were manually extracted from the Grumbletext website.


Still can’t find what you need? Lionbridge AI creates and annotates customized datasets for a wide variety of NLP projects, including everything from chatbot variations to entity annotation. With over 20 years of experience in managing a crowd of over 500,000+ linguistic specialists, Lionbridge AI is perfectly placed to provide your model with a solid foundation.

Source: The Best 25 Datasets for Natural Language Processing | Lionbridge AI


Advanced ML/DL/RL Theory Courses

Advanced ML/DL/RL, attempts at building theory of DL, optimization theory, advanced applications etc.

ML >>

ML >> Theory

DL >>

DL >> Theory

RL >>

Optimization >>

Applications >> Computer Vision

Applications >> Natural Language Processing

Applications >> 3D Graphics


MoVi: A Large Multipurpose Motion and Video Dataset: Model and Code

Human movements are both an area of intense study and the basis of many applications such as character animation. For many applications, it is crucial to identify movements from videos or analyze datasets of movements. Here we introduce a new human Motion and Video dataset MoVi, which we make available publicly. It contains 60 female and 30 male actors performing a collection of 20 predefined everyday actions and sports movements, and one self-chosen movement. In five capture rounds, the same actors and movements were recorded using different hardware systems, including an optical motion capture system, video cameras, and inertial measurement units (IMU). For some of the capture rounds, the actors were recorded when wearing natural clothing, for the other rounds they wore minimal clothing. In total, our dataset contains 9 hours of motion capture data, 17 hours of video data from 4 different points of view (including one hand-held camera), and 6.6 hours of IMU data. In this paper, we describe how the dataset was collected and post-processed; We present state-of-the-art estimates of skeletal motions and full-body shape deformations associated with skeletal motion. We discuss examples for potential studies this dataset could enable.

Source: MoVi: A Large Multipurpose Motion and Video Dataset: Model and Code


List of Machine Learning Resources for Beginner

  1. ML in the cloud training
    1. Google
      1. Google ML Crash Course
      2. Google AI Education
    2. Azure
      1. Machine learning crash course – Learn
      2. Intro to ML with Python and Azure Notebooks
      3. Build AI Solutions with Azure ML
      4. Explore AI solution development with data science services in Azure
    3. AWS
      1. AWS Learning Library
      2. Machine Learning Training on AWS
      3. Data Science Learning Path
  2. Websites and Resources
    1. Blogs &amp; Social Media
      1. KD Nuggets
    2. General ML and Data Science
      1. Machine Learning Mastery
      2. Towards Data Science
      3. Machine Learning Websites
  3. Training Courses
    1. Stanford ML Course
    2. Coursera Training Resources
  4. Deep Learning
    2. Deep Learning Specialization with Andrew Ng
    3. Deep Learning Textbook
  5. ML Math
    1. Introduction to Linear Algebra
    2. Linear Algebra – Hefferon
    3. Deep Learning Math
    4. CS229 Notes on Linear Algebra
    5. AWS Math for Machine Learning
    6. Essential Math for Machine Learning – Python Edition
    7. Computational Linear Algebra
  6. ML Books
    1. Andrew Ng’s AI Transformation Playbook
    2. Machine Learning Yearning by Andrew Ng
  7. ML Programming
    1. Python for Data Science and Machine Learning Bootcamp
    2. Machine Learning, Data Science and Deep Learning with Python
    3. Machine Learning A-Z™: Hands-On Python &amp; R In Data Science
    4. ML/DL Matlab eBook
    5. MIT Course
    6. Python for Data Science and Machine Learning Bootcamp
    7. Introduction to Pandas
    8. Pandas Data Structures
  8. Frameworks
    1. PyBrain
    2. PyML
    3. scikit-learn
    4. tensor-flow
    5. Tensor-flow Playground
    6. TensorFlow
    7. Keras
    8. MXNet
  9. Algorithms
    1. Algorithm Cheat Sheets
    2. Algorithmia
  10. Datasets
Machine Learning

Trends in Machine Learning in 2020

Many industries realize the potential of Machine Learning and are incorporating it as a core technology. Progress and new applications of these tools are moving quickly in the field, and we discuss expected upcoming trends in Machine Learning for 2020.

By Tanya Singh.

To many, Machine Learning may be a new word, but it was first coined by Arthur Samuel in 1952, and since then, the constant evolution of Machine Learning has made it the go-to technology for many sectors. Right from robotic process automation to technical expertise, Machine Learning technology is extensively used to make predictions and get valuable insight into business operations. It’s considered as the subset of Artificial Intelligence (intelligence demonstrated by machines).

If we go by the books, Machine Learning can be defined as a scientific study of statistical models and complex algorithms that primarily rely on patterns and inference. The technology works independently of any explicit instruction, and that’s its strength.

The impact of Machine Learning is quite engrossing, as it has captured the attention of many companies, irrespective of their industry type. In the name of the game, Machine Learning has truly transformed the fundamentals of industries for better.

The significance of Machine Learning can be caused by the fact that $28.5 billion was allocated in this technology during the first quarter of 2019, as reported by Statista.

Taking the relevance of Machine Learning into account, we have come up with trends that are going to make way into the market in 2020. The following are the much-anticipated Machine Learning trends that will alter the basis of industries across the globe.


1) Regulation of Digital Data

In today’s world, data is everything. The emergence of various technologies has propelled the supplement of data. Be it the automotive industry or the manufacturing sector; data is generating at an unprecedented pace. But the question is, ‘is all the data relevant?’

Well, to untangle this mystery, Machine Learning can be deployed, as it can sort any amount of data by setting up cloud solutions and data centers. It simply filters the data as per its significance and brings up the functional data, while leaving behind the scrap. This way, it saves time allows organizations to manage the expenditure, as well.

In 2020, an enormous amount of data will be produced, and industries will require Machine Learning to categorize the relevant data for better efficiency.


2) Machine Learning in Voice Assistance

According to the emarketer study in 2019, it was estimated that 111.8 million people in the US would use a voice assistant for various purposes. So it’s quite evident that voice assistants are a considerable part of industries. Siri, Cortana, Google Assistant, and Amazon Alexa are some of the in-demand examples of intelligent personal assistants.

Machine Learning, coupled with Artificial Intelligence, aids in processing operations with the utmost accuracy. Therefore, Machine Learning is going to help industries to perform complicated and significant tasks effortlessly while enhancing productivity.

It’s expected that in 2020, the growing areas of research & investment will mainly focus on churning out custom-designed Machine Learning voice assistance.


3) For Effective Marketing

Marketing is a vital factor for every business to survive in the prevailing cut-throat competition. It promotes the presence and visibility of business while driving the intended results. But with the existing multiple marketing platforms, it has become challenging even to prove the business existence.

However, if a business is successful enough to extract the patterns from the existing user data, then the business is very much expected to formulate successful and effective marketing strategies. And to analyze the data, Machine Learning can be deployed to mine data and evaluate research methods for more beneficial results.

Adoption of Machine Learning in defining effective marketing strategies is highly anticipated in the future course of time.


4) Advancement of Cyber Security

In recent times, cyberspace has become the talk of the town. As reported by Panda Security, about 230,000 malware samples are created every day by hackers, and the intention to create the malware is always crystal clear. And with the computers, networks, programs, and data centers, it becomes even more problematic to check the malware attacks.

Thankfully, we have Machine Learning technology that aids the multiple layers of protection by automating complex tasks and detecting cyber-attacks on its own. Not only this, but Machine Learning can also be extended to react to cybersecurity breaches and mitigate the damage. It automates responses to cyber-attacks without the need for human intervention.

Going forward, Machine Learning will be used in advanced cyber defense programs to contain and save damage.


Faster Computing Power

Industry analysts have started grasping the power of artificial neural networks, and that’s because we all can foresee the algorithmic breakthroughs that will be required for aiding the problem-solving systems. Here, Artificial Intelligence and Machine Learning can address the complex issues that will require explorations and regulating decision-making capacity. And once all of it is deciphered, we can expect to experience ever-blazing computing power.

Enterprises like Intel, Hailo, and Nvidia have already geared up to empower the existing neural network processing via custom hardware chips and explainability of AI algorithms.

Once the businesses figure out the computing capability to run Machine Learning algorithms progressively, we can expect to witness more power centers, who can invest in crafting hardware for data sources along the edge.


The Endnote

Without reserve, we can say that Machine Learning is going big day by day, and in 2020, we will be experiencing added applications of this innovative technology. And why not? With Machine Learning, industries can forecast demands and make quick decisions while riding on advanced Machine Learning solutions. Managing complex tasks and maintaining accuracy is the key to business success, and Machine Learning is immaculate in doing the same.

All the trends, as mentioned above of Machine Learning, are quite practical and look promising in imparting unprecedented customer satisfaction. The dynamic dimensions of ever-growing industries further propel the relevance of Machine Learning trends.

Source: Trends in Machine Learning in 2020